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Abstract 

Three-dimensional models which contain both geometry and texture have numerous appli¬ 
cations such as urban planning, physical simulation, and virtual environments. A major fo¬ 
cus of computer vision (and recently graphics) research is the automatic recovery of three- 
dimensional models from two-dimensional images. After many years of research this goal is 
yet to be achieved. Most practical modeling systems require substantial human input and 
unlike automatic systems are not scalable. 

This thesis presents a novel method for automatically recovering dense surface patches us¬ 
ing large sets (1000’s) of calibrated images taken from arbitrary positions within the scene. 
Physical instruments, such as Global Positioning System (GPS), inertial sensors, and incli¬ 
nometers, are used to estimate the position and orientation of each image. Essentially, the 
problem is to find corresponding points in each of the images. Once a correspondence has been 
established, calculating its three-dimensional position is simply a matter of geometry. Long 
baseline images improve the accuracy. Short baseline images and the large number of images 
greatly simplifies the correspondence problem. Theinitial stage of the algorithm iscompletely 
local and scales linearly with the number of images. Subsequent stages are global in nature, 
exploit geometric constraints, and scale quadratically with the complexity of the underlying 
scene. 

We describe techniques for: 1) detecting and localizing surface patches; 2) refining camera 
calibration estimates and rejecting false positive surfels; and 3) grouping surface patches into 
surfaces and growing the surface along a two-dimensional manifold. We also discuss a method 
for producing high quality, textured three-dimensional models from these surfaces. Someofthe 
most important characteristics of this approach are that it: 1) uses and refines noisy calibra¬ 
tion estimates; 2) compensates for large variations in illumination; 3) tolerates significant soft 
occlusion (e.g. tree branches); and 4) associates, at a fundamental level, an estimated normal 
(eliminating the frontal-planar assumption) and texture with each surface patch. 
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Chapter 1 
Introduction 


Three-dimensional models which contain both geometry and texture have nu¬ 
merous appl ications. They are used extensively as vi rtual envi ronments for en- 
tertainment (e.g. games, videos, commercials and movies). Three-dimensional 
models also have serious applications such as physical simulation and urban 
planning. Recent advances in hardware and software have improved our abil¬ 
ity to store, navigate and display large, complex models. However, model con¬ 
struction is frequently a tedious and time consuming task requiring significant 
human input. As the size, complexity, and realism of models increase, man¬ 
ual, and even interactive or semi-automatic, model construction will rapidly 
become impractical. 

This thesis presents a novel method for automatically recovering dense sur¬ 
face patches using large sets (1000's) of calibrated images taken from arbitrary 
positions within the scene. Physical instruments, such as Global Positioning 
System (GPS), inertial sensors, and inclinometers, are used to estimate the po¬ 
sition and orientation of each image. The large set of images acquired from a 
wide range of viewpoints aids in identifying correspondences and enables accu¬ 
rate computation of their three-dimensional position. We describe techniques 
for: 


• Detecting and localizing surface patches. 

• Refining camera calibration estimates and rejecting false positive surfels. 

• Grouping surface patches into surfaces. 

• Growing surfaces along a two-dimensional manifold. 

In addition, we also discuss a method for producing high quality, textured 
three-dimensional models from these surfaces. The initial stage of the al¬ 
gorithm is completely local making it easily parallelizable. Some of our ap¬ 
proach's most important characteristics are: 

• It is fully automatic. 

• It uses and refines noisy calibration estimates. 
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• It compensates for large variations in illumination. 

• It matches image data directly in three-dimensional space. 

• It tolerates significant soft occlusion (e.g. tree branches). 

• It associates, at a fundamental level, an estimated normal (eliminating 
thefrontal-planar assumption) and texture with each surface patch. 



Figure 1-1: Albrecht Diir er, Artist Drawing a Lute, 1525. 


1.1 Background 

The basics of perspective image for mat ion have been known for more than 2000 
years and date back to Pythagoras, Euclid, and Ptolemy. In the 15 th century, 
Leono Alberti published thefirst treatise on perspective and later that century 
Leonardo da Vinci studied perspective projection and depth perception. By the 
16 th century these concepts were well known to artists (e.g. Figure 1-1). In the 
18 th century J ohan Heinrich Lambert developed a technique cal led space resec¬ 
tion to find the point in space from which a picture was made. Through the end 
of the 18 th century, the study of perspective focused exclusively on image for¬ 
mation. The main motivation was accurately depicting the three-dimensional 
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world on a two-dimensional canvas. Early in the 19 th century the photograph 
was invented and the natural question followed: Can two-dimensional pho¬ 
tographs be used to deducethethree-dimensional world which produced them ? 

The problem of recovering three-dimensional information from a set of pho¬ 
tographs or images is essentially the correspondence problem: Given a point in 
one image find the corresponding point in each of the other images. Typically, 
photogrammetric approaches (Section 1.1.1) require manual identification of 
correspondences, while computer vision approaches (Section 1.1.2) rely on au¬ 
tomatic identification of correspondences. If the images are from nearby po¬ 
sitions and similar orientations (short baseline), they often vary only slightly, 
simplifying the identification of correspondences. Once sufficient correspon¬ 
dences have been identified, solving for the depth is simply a matter of geom¬ 
etry. This applies to both calibrated and uncalibrated images. For calibrated 
images (known internal calibration, camera position, and orientation) a sin¬ 
gle correspondence is sufficient to triangulate the three-dimensional position 
of the scene point which gave rise to the corresponding image points. The un¬ 
calibrated case is more complicated requiring additional correspondences to 
recover the calibration as well as the three-dimensional information. 



Figure 1-2: Calculating Three-Dimensional Position. 

Real images are composed of noisy, discrete samples, therefore the calcu¬ 
lated three-dimensional location of a correspondence will contain error. This 
error is a function of the baseline or distance between the images. Figure 1- 
2 shows how the location of point P can be calculated given two images taken 
from known cameras Ci and C 2 and corresponding pointsp 1 andp 2 within those 
images, which are projections of P. The exact location of p 1 in the image is un- 
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certain, as a result P can lie anywhere within the left cone. A similar situation 
exists for p 2 . If p 1 and p 2 are corresponding points, then P could lie anywhere 
in the shaded region. Clearly, for this situation increasing the baseline be¬ 
tween Ci and C 2 will reduce the uncertainty in the location of P. This leads 
to a conflict: short baselines simplify the identification of correspondences but 
produce inaccurate results; long baselines produce accurate results but compli¬ 
cate the identification of correspondences. Real images also frequently contain 
occlusion and significant variations in illumination both of which complicate 
the matching process. 

1.1.1 Photogrammetry 



Figure 1-3: 19 th Century Stereoscope. 

About 1840, Dominique FrangoisJ ean Arago, the French geodesist, advo¬ 
cated the use of photography by topographers. A decade later, Ai me Laussedat, 
a Colonel in the French Army Corps of Engineers, set out to prove that photog¬ 
raphy could be used to prepare topographic maps. Over the next 50 years, his 
work was so complete that Aime Laussedat is often referred to as the father 
of photogrammetry. Prior to the advent of computing, analog techniques were 
used to physically reproject single images and stereo pairs. The stereoscopic 
viewer, shown in Figure 1-3, is one such reprojection device. The availability 
of computing in the mid-20 th century enabled the introduction of analytical 
techniques. For the first time multiple (more than two) photographs could be 
analyzed simultaneously [Wolf, 1974, Slama dtai, 1980]. 

In spite of the fact that it originated with ground based imagery, modern 
photogrammetry uses long-range aerial imagery almost exclusively. This is 
in contrast with the close-range ground base imagery used as input for this 
thesis. For accurate results, photogrammetry requires good camera calibra- 
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tion. The internal parameters 1 are measured in a controlled environment 
and traditionally the external parameters are estimated using points whose 
three-dimensional position is known or tie-points. Classical photogrammetry 
requires a large amount of human input, typically in the form of identifying 
corresponding features in multiple photographs. The correspondences are then 
used to recalculate the external parameters, as well as determine the three- 
dimensional position of selected features. Two classes of techniques are used 
to reestimate the external parameters. Closed-form analytic techniques re¬ 
quire fewer correspondences and do not need an initial estimate, but tend to 
be numerically unstableand sensitive to error in the data. Global optimization 
techniques, such as bundle-adjustment, require both an initial estimate and 
many correspondences, but generally produce more stable results. The cali¬ 
bration updates described in Section 5.1 fall into the latter category. Recently, 
automated photogrammetric methods have been explored [Greeve, 1996]. 

A number of the recent interactive modeling systems are based upon pho¬ 
togrammetry 2 . Research projects such as RADIUS [Collins et al., 1995] and 
commercial systems such as FotoG [Vexcel, 1997] are commonly used to ex¬ 
tract three-dimensional models from images. Good results have been achieved 
with these systems, however the requirement for human input limits the size 
and complexity of the recovered model. One approach to reducing the amount 
of human input is to exploit geometric constraints. The geometric structure 
typical of urban environments can be used to constrain the modeling process. 
Becker and Bove[1995] manually identify groups of parallel and perpendicular 
lines across small sets of images. Shum et al. [1998] interactively draw points, 
lines and planes onto a few panoramic mosaics. Debevec etal.'s Facade system 
[1996] uses a small set of building blocks (cube, cylinder, etc.) to limit the set 
of possible models from a small set of images (at most a couple dozen). The 
major strength of these systems is the textured three-dimensional model pro¬ 
duced. Debevec et al. cite a hundred fold decrease in human input using the 
block-based approach. Despite this reduction, each image must be processed 
individually by a human to produce a three-dimensional model, making it dif¬ 
ficult to extend these systems to large sets of i mages. 

1.1.2 C omputer Vi si on 

The field of computer vision began with the advent of computing in the mid- 
20 th century and in many ways developed in parallel to, but separate from 
photogrammetry [FIorn, 1986, Mayhew and Frisby, 1991, Faugeras, 1993]. A 
major focus of computer vision is the automatic recovery of three-dimensional 

1 See Appendix A for a discussion of internal and external parameters. 

2 Some might consider these to be in the field of computer vision. We have placed them in 
this section because, liketraditional photogrammetric methods, they require human input. 
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information from two-dimensional images. Both calibrated and uncalibrated 
images can be used and as noted above, recovering three-dimensional informa¬ 
tion can be reduced to finding correspondences 

Uncalibrated Imagery 

Several researchers [Longuet-Higgins, 1981, Horn, 1987, Mohr and Arbogast, 
1991, Hartley, 1992, Hartley dt a/., 1992] have shown that the relative cam¬ 
era orientations 3 and the three-dimensional structure of the scene can be re¬ 
covered from images with known internal calibration and unknown external 
calibration. If the internal calibration is also unknown, Faugeras [1992] has 
shown that the scene structure can only be determined up to an unknown pro¬ 
jective transformation. Structure from motion and direct motion estimation 
are two popular approaches which use uncalibrated imagery. Structure from 
motion techniques [Ullman, 1978, Ullman, 1979, Tomasi and Kanade, 1992, 
Azarbayejani and Pentland, 1995, Seales and Faugeras, 1995] identify and 
track high-level features (e.g. edges) across several images. Because both 
structure and motion are recovered, a large set of features must be tracked 
for a robust solution. Direct motion estimation techniques [Mohr eta/., 1993, 
Szeliski and Kang, 1994, Ayer and Sawhney, 1995, Adel son and Weiss, 1996, 

I rani et a!., 1997] use the optical flow of each pixel across images to directly 
estimate structure and motion. Computing optical flow tends to be sensitive 
to changes in illumination and occlusion. The need to track features or calcu¬ 
late optical flow across images implies that adjacent images must be close. The 
cost of acquiring such an image set rapidly becomes prohibitive for large-scale 
models. 

Calibrated Imagery 

A popular set of approaches for automatically finding correspondences from 
calibrated images is relaxation techniques [Marr and Poggio, 1979]. These 
methods are generally used on a pair of images; start with an educated guess 
for the correspondences; then update them by propagating constraints. These 
techniques often exploit global constraints such as smoothness [Pollard et a!., 
1985] and ordering [Ohta and Kanade, 1985]. They don't always converge and 
don't always recover the correct correspondences. Utilizing only one pair of 
images at a time has several disadvantages. 

• The trade off between easily identifying correspondences and accurate 
results (discussed above) means that these methods generally produce 
less accurate results. 


External calibration in an unknown coordinate frame. 
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• The results from one pair are restricted to the regions visible in both im¬ 
ages. To reconstruct a large volume many pairs are required, leading to 
the difficult problem of integrating the results from multiple pairs. 

Another approach is to use multiple images. Several researchers, such as 
Yachida [1986], have proposed trinocular stereo algorithms and some have 
proposed using a third image to check correspondences [Faugeras and Robert, 
1994]. Others have also used special camera configurations to aid in the cor¬ 
respondence problem, [Tsai, 1983, Bolles eta/., 1987, Okutomi and Kanade, 
1993]. Bolles eta/. [1987] proposed constructing an epi polar-plane image from 
a large number of images. In some cases, analyzing the epi polar-plane image 
is much simpler than analyzing the original set of images. The epi polar-plane 
image, however, is only defined for a limited set of camera positions. Baker and 
Bolles [1989] have also attempted to extend the limited set of usable camera 
positions. Tsai [1983] and Okutomi and Kanade [Okutomi and Kanade, 1993, 
Kanade and Okutomi, 1994] defined a cost function which was applied directly 
to a set of images. The extremum of this cost function was then taken as the 
correct correspondence. Occlusion is assumed to be negligible. I n fact, Okutomi 
and Kanade state that they "invariably obtained better results by using rela¬ 
tively short baselines." This is likely the result of using an image space match¬ 
ing metric (a correlation window) and ignoring perspective distortion and oc¬ 
clusion. Both methods use small sets of images, typically about ten. They also 
limit camera positions to special configurations. Tsai uses a localized planar 
configuration with parallel optical axes; and Okutomi and Kanade use short 
linear configurations. 

Some recent techniques use multiple images which are not restricted to 
rigid camera configurations. Kang et al. [1995] use structured light to aid in 
identifying correspondences. Kang and Szeliski [1996] track a large number of 
features across a small number of closely spaced (~ 3 inches) panoramic images 
(360° field of view). Neither of these approaches is well suited to ext ended urban 
environments. Structured lighting is difficult to use outdoors and tracking a 
dense set of features across multiple large datasets is difficult at best. 

Collins [1996] proposed a space-sweep method which performs matching in 
three-dimensional space instead of image space. There are several advantages 
to matching in three-dimensional space. It naturally operates on multiple im¬ 
ages and can properly handle perspective distortion and occlusion, improving 
the matching process. Col I ins'formulation assumes that there is a singleplane 
which separates all of the cameras from the scene. As the plane is swept 
away from the cameras, features from all of the images are projected onto it. 
Sweep plane locations which have more than a minimum number of features 
are retained. The resulting output is a sparse set of three-dimensional points. 
Collins demonstrates his method on a set of seven aerial photographs. Seitz 
and Dyer [1997] proposed a voxel coloring method which also performs match- 
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ing in three-dimensional space. Voxels are reconstructed in a special order such 
that if point P occludes point Q for any image then P is reconstructed before Q. 
To meet this ordering requirement, no scene points areallowed within thecon- 
vex hull of the cameras. Collins'sweep-plane is a special case of this ordering 
requirement. Neither of these approaches is suitable for reconstructions using 
images acquired from within the scene. 

Kutulakos and Seitz [1998b, 1998a] have proposed an extension of voxel 
coloring called space carving. The separability requirement is removed. An 
initial estimate of the reconstructed volume is required and must be a rela¬ 
tively tight superset of theactual volume. Voxels aretested in order similar to 
Seitz and Dyer. Background pixels are removed and matching is performed on 
raw pixel values. If the projections of a voxel into the images are not consistent, 
the voxel is carved away. The reconstruction is complete when no more voxels 
can be removed. The carving operation is brittle. If a single image disagrees 
the voxel is removed, making this approach particularly sensitive to camera 
calibration errors, illumination changes, complex reflectance functions, image 
noise, and temporal variations. In short, it is not well suited to outdoor urban 
environments. 

While computer vision techniques eliminate the need for human input in 
general they have several characteristics which limit their applicability in cre¬ 
ating large, complex, and realistic three-dimensional models. 

• They are frequently fragile with respect to occlusion and variations in 
illumination. 

• They frequently operate on only a few images and do not scale to large 
sets of i mages. 

• The output (e.g. depth maps and isolated edges) aregenerally not directly 
useful as a model. 

Recently, the last item has been partially addressed by Fua [Fua, 1995, Fua 
and Leclerc, 1995, Fua and Leclerc, 1996] who starts with a set of three- 
dimensional points and fits small surface elements to the data. Fitting is ac¬ 
complished by optimizing an image-based objective function. This is in contrast 
with the approach presented in this thesis which recovers surface elements di¬ 
rectly from the image data. In addition, as presented Fua's approach does not 
have the ability to fill in holes in the three-dimensional points used as input. 

Finally, several researchers have used probability theory to aid in recov¬ 
ering three-dimensional information. Co xetal. [Cox, 1994, Cox et ah, 1996] 
proposed a maximum-likelihood framework for stereo pairs, which they have 
extended to multi pie images. This work attempts to explicitly model occlusions, 
although in a somewhat ad hoc manner. Belhumeur [Belhumeur and Mumford, 
1992, Belhumeur, 1993, Belhumeur, 1996] develops a detailed Bayesian model 
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of image formation and the structure of the world. These two approaches serve 
as the inspiration for the probabilistic elements of this thesis. 

1.1.3 Discussion 

Photogrammetric methods produce high quality models, but require human 
input. Computer vision techniques function automatically, but generally do 
not produce usable models, operate on small sets of images and frequently 
are fragile with respect to occlusion and changes in illumination. The work 
presented in this thesis draws from both photogrammetry and computer vi¬ 
sion. Like photogrammetric methods we produce high quality textured models 
and like computer vision techniques our method is fully automatic. The major 
inspiration derives from Bolles et al. [Bolles et at., 1987, Baker and Bolles, 
1989]. We define a construct called an epipolar image and use it to analyze 
evidence about three-dimensional position and orientation. LikeTsai [1983] 
and Okutomi and KanadetOkutomi and Kanade, 1993, Kanadeand Okutomi, 
1994] we define a cost function that is applied across multiple images, how¬ 
ever, we do not evaluate the cost function in image space. Instead, I ike Col I ins 
[1996], Seitz and Dyer [1997], and Kutulakos and Seitz [Kutulakos and Seitz, 
1998b, Kutulakos and Seitz, 1998a] we perform matching in three-dimensional 
space. We also model occlusion and the imaging process similar to Cox et 
al. [1996] and Belhumeur [Belhumeur and Mumford, 1992, Belhumeur, 1993, 
Belhumeur, 1996]. 

There are also several i mportant differences from previous work. The epi po¬ 
lar image we define is valid for arbitrary camera positions within the scene and 
is capable of analyzing very large sets of images. Our focus is recovering built 
geometry (architectural facades) in an urban environment. However, the al¬ 
gorithms presented are generally applicable to objects that can be modeled by 
small planar patches. Surface patches (geometry and texture) or surfels are 
recovered directly from the image data. I n most cases, three-dimensional posi¬ 
tion and orientation can be recovered using purely local information, avoiding 
the computational costs of global constraints. Some of the significant charac¬ 
teristics of this approach are: 

• Large sets of images contain both long and short baseline images and ex¬ 
hibit the benefits of both (accuracy and ease of matching). It also makes 
our method robust to sensor noise and occlusion, and provides the infor¬ 
mation content required to construct complex models. 

• Each image is calibrated - its position and orientation in a single global 
coordinate system is estimated. The use of a global coordinate system 
allows data to be easily merged and facilitates geometric constraints. 
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• Thefundamental unit is a textured surface patch and matching is done in 
three-dimensional space. This eliminates the need for the frontal-planar 
assumption made by many computer vision techniques and provides ro¬ 
bustness to soft occlusion (eg. tree branches). Also, the surface patches 
are immediately useful as a coarse model and readily lend themselves to 
aggregation to form more refined models. 

• The algorithm tolerates significant noise in the calibration estimates and 
produces updates to those estimates. 

• Thealgorithm corrects for changes in illumination. This allows raw image 
properties (eg. pixel values) to be used avoiding the errors and ambigui¬ 
ties associated with higher level constructs such as edges or corners. 

• Thealgorithm scales well. The initial stage is completely local and scales 
linearly with the number of images. Subsequent stages are global in na¬ 
ture, exploit geometric constraints, and scale quadratical ly with the com¬ 
plexity of the underlying scene. 4 

As noted above, not all of these characteristics are unique, but their combi¬ 
nation produces a novel method of automatically recovering three-dimensional 
geometry and texture from large sets of images. 


1.2 City Scanning Project 

This thesis is part of the City Scanning project whose primary focus is the 
Automatic Population of Geospatial Databases (APGD). As its name implies, 
the main target of the City Scanning project is urban environments such as 
shown in Figure 1-4. In addition to the one presented in this thesis, other 
approaches to three-dimensional reconstruction have been explored as part of 
the City Scanning project. Coorg [1998] hypothesizes large (building size) ver¬ 
tical faces from sparse edge information and Chou [1998] attempts to deduce 
faces by aggregating sparse three-dimensional features. I n contrast with these 
approaches, this thesis performs dense reconstruction directly from the image 
data. Another major focus of the project is the interactive navigation and ren¬ 
dering of city sized databases. The following sections describe the acquisition 
and preprocessing phases which produce the calibrated images used as input 
for the algorithms presented in this thesis. For a more complete description of 
the project see [Teller, 1998a, Teller, 1998b]. 

4 This is the worst case complexity. With spatial hashing the expected complexity is linear 
in the number of reconstructed surfels. 
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Figure 1-4: Goal of the city project. Rendering from an idealized model built 
with human input. 
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Figure 1-5: Argus. 



Figure 1-6: FIemispherical tiling for a node. 
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1.2.1 Node Acquisition 

A self contained platform called Argus (Figure 1-5) continues to be developed 
to acquire calibrated images[De Couto, 1998]. At the center of Argus is a 
precision computer controlled pan-tilt head upon which a digital camera is 
mounted. The camera is initially calibrated using Tsai's method [Tsai, 1983, 
Tsai, 1987] and is configured to rotate about its center of projection. At each 
node images are acquired in a hemispherical tiling similar to that shown in 
Figure 1-6. Argus is also equipped with a number of navigational sensors in¬ 
cluding GPS, inertial sensors, odometry and inclinometers which enable it to 
estimate the position and orientation of each image. Currently node positions 
and orientations are also surveyed by hand which serves as validation. The 
raw image data, exposure parameters, and absolute position and orientation 
estimates are recorded by an on-board computer. 



Figure 1-7: Spherical Mosaic for a node. 


1.2.2 Mosaicing and Registration 

The position and orientation estimates obtained during the acquisition phase 
are good, but contain a significant amount of error. Refining these estimates 
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has two parts: mosaicing and registration [Coorg et al., 1998, Coorg, 1998]. 
Mosaicing exploits the constraint that all of the images in a given node share 
the same center of projection and form (somewhat more than) a hemisphere. 
The relative orientations of a node are optimized by correlating the overlap¬ 
ping regions of adjacent images. A sample of the spherical mosaic produced is 
shown in Figure 1-7. Once mosaicing is completed a node can be treated as 
a single image with a single position and orientation. Registration identifies 
a few prominent features per node and performs a global optimization which 
refines the location and orientation of each node. 


1.3 Thesis Overview 

Figure 1-8 shows an overview of the reconstruction pipeline described in this 
thesis. The calibrated imagery described in the last section serves as input to 
the pipeline. The left hand column shows the major steps of our approach; the 
right hand side shows example output at various stages. The output of the 
pipeline is a textured three-dimensional model. Our approach can be generally 
characterized as hypothesize and test. We hypothesize a surfel and then test 
whether it is consistent with the data. Chapter 2 presents our basic approach. 
To simplify the presentation in this chapter we assume perfect data (i.e. perfect 
calibration, constant illumination, diffuse surfaces, and no occlusion or image 
noise) and use a synthetic dataset to demonstrate the algorithm. Chapter 3 
extends the theory to cover camera calibration error, variations in illumina¬ 
tion, occlusion, and image noise. Chapter 4 explores detecting and localizing 
surfels. We demonstrate the ability of our algorithm to compensate for cam¬ 
era calibration error, variations in illumination, occlusion, and image noise. 
We examine its ability to detect and localize three-dimensional surfaces from 
a significant distance (both in position and orientation) using a purely local 
algorithm. Noisy data causes some true positives to be missed and compensat¬ 
ing for noisy data increases the number of false positives recovered. Chapter 5 
introduces several techniques to remove false positives and fill in the missing 
parts. Camera updates and grouping surfels into surface impose global con¬ 
straints which remove nearly all of the false positives. Growing surfaces fills 
in most of the reconstruction. Finally we discuss fitting simple models and ex¬ 
tracting textures. We present the algorithms as well as the results of applying 
them to a large dataset. 


1.3.1 Definitions 

Some of the more general or fundamental terms used throughout this thesis 
are defined below. Others will be defined as needed. 
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Figure 1-8: Overview of Thesis. 
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• Internal Parameters: The camera parameters which govern the image 
formation process. The exact parameters depend on the camera model 
used, but usually include: focal length, aspect ratio, and principle point. 
See Appendix A for a more complete description. 

• External Parameters: The camera parameters which govern what, in 
an absolute coordinate system, is imaged - the location of the center of 
projection and the orientation of the optical axis. See Appendix A for a 
more complete description. 

• Calibrated Image: An image for which both the internal and external 
calibration are known. Also referred to as a pose image. 

• Node: A set of images acquired from the same location. In other words, 
all of the images in a node share the same center of projection. The 
images of a node typically have orientations which tile a hemisphere or 
moretCoorg etal., 1998]. 

• Surfel: A small planar surface patch or surface dement. This definition 
differs from Szeliski's [Szeliski and Tonnesen, 1992] in that it refers to a 
finite sized patch which includes both geometry and texture. 

• Units: All of the datasets used in this thesis share a common global coor¬ 
dinate system in which 1 unit equals 1/10 foot. 

The pin-hole camera model is used throughout this thesis. As noted in Ap¬ 
pendix A it is a linear model and is not capable of modeling nonlinear dis¬ 
tortion. Images which have large amounts of distortion can be unwarped as 
a preprocessing step to remove the nonlinear portions, however this was not 
necessary for the datasets used i n thi s thesi s. 



Chapter 2 

The Basic Approach 


In this chapter we describe our basic approach. To simplify the presentation 
we assume perfect data (i.e. perfect calibration, constant illumination, diffuse 
surfaces, and no occlusion or image noise). These assumptions are relaxed in 
the following chapter. The approach presented here was initially inspired by 
the work of Bolles eta/. [1987]. The notation used in this chapter is defined in 
Table 2.1 and Section 2.1 reviews epipolar geometry and epi polar-plane images. 
Wethen definea construct called an epi polar image (Section 2.2) and show how 
it can be used to reconstruct three-dimensional information from large sets of 
images (Section 2.3). LikeTsai [1983] and Okutomi and Kanade[1993] we de¬ 
fine a cost function that is applied across multiple images, however, we do not 
evaluate the cost function in image space. Instead, like Collins [1996], Seitz 
and Dyer [1997], and Kutulakos and Seitz [1998b, 1998a] we perform match¬ 
ing in three-dimensional space. We also model occlusion and the imaging pro¬ 
cess similar to Cox [1996] and Belhumeur [Belhumeur and Mumford, 1992, 
Belhumeur, 1993, Belhumeur, 1996]. There are several important differences, 
however. An epi polar image is valid for arbitrary camera positions and over¬ 
comes some forms of occlusion. Where three-dimensional information cannot 
be recovered using purely local information, the evidence from the epipolar 
image provides a principled distribution for use in a maximum-likelihood ap¬ 
proach (Section 2.4) [Duda and Hart, 1973]. Finally, we present the results of 
using epipolar images toanalyze a set of synthetic images (Section 2.5). 


2.1 Epipolar Geometry 

Epipolar geometry provides a powerful constraint for identifying correspon¬ 
dences. Given two cameras with known centers Ci and C 2 and a point P in the 
world, the epipolar plane IT is defined as shown in Figure 2-1. P projects to 
p 1 and p 2 on image planes III and n 2 respectively. The projection of IT ontonj 
and n 2 produce epipolar lines t\ and t\. This is the essence of the epi polar con¬ 
straint. Given any point p on epipolar line^ in imagenj, if the corresponding 
point is visible in image II 2 , then it must Neon the epi polar line 
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Pi 

Absolute coordinates of thej th surfel. 

n i 

Orientation of thej th surfel. 

Q 

Center of projection for the i th camera. 

nj 

Image plane for the i th camera. 

r 

Calibrated image. P = (n?,Q). 

I 

Set of calibrated images. {P}. 

Pi 

Image point. Projection of P, ontonj. 

He 

The k th epipolar plane. 

fk,i 

E pi polar line. Projection of onton*. 

p* 

Base image point. Any point in any image. 

P 

Set of projections of P i( |p*}. 

p 11 

Set of projections of P^ for which the projection is in the 
front half space, jp* QP* • n < 0 j. 

c* 

Base camera center. Camera center associated with p*. 

c 

Set of camera centers, {Q}. M ay or may not include C* 

nr 

Base image. Containsp*. 

n 

Set of image planes, {PI*}. May or may not includen* 

t 

Base line. 3D line passing through p* and C*. 

f*,i 

Epi polar line. Projection of t ontonj. 

** 

e 

Set of epi polar lines, {t/}. 

SPk 

Epi polar plane image. Constructed using n£. 

E x 

Epi polar image constructed using p*. x indexes all p*'s. 

T(P,I) 

Function which projects P ontoI, e.g. T(P j ,l i ) = p*. 

V(P,n,I) 

Ideal image values (no noise or occlusion) at T(P,I). 


Function of the image at points (e.g. pixel values, 
features, etc). 


Matching function. Quality of match between x 1 andx 2 . 

a ) 

Match quality. Used to analyzed. Also//(P i ,n i ) and z/(Sj). 

d(p 4 ) 

Depth at image point p\ Distance from Q to the closest 

actual surface in the direction Qp\ If low confidence 
or unknown, then 00 . 

o(pj-) 

Orientation at image point p*. Orientation of to the 

J - $ 

closest actual surface in thedirection Qp*. 

G(/i, ex 2 , x) 

Gaussian distribution with mean /j and variance o 2 
evaluated at x. 

{E|G} 

Set of all E's such that C is true. 

P(PIC') 

Probability of P conditioned on C. 

P 1 P 2 

Unit vector in thedirection from P 1 toP 2 . 

M; 

Modeled object. Previously reconstructed. 


Table 2.1: Notation used for the basic approach. 
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Figure2-1: E pi polar geometry. 


Bolles et al. [Bolles et at., 1987] used the epipolar constraint to construct 
a special image which they called an epi polar-plane image. As noted earlier, 
an epi polar line 4 contains all of the information about the epi polar plane lie 
that is present in thez th image n*. An epi polar-plane image is built using all of 
the epipolar lines 4 ! (the boldface symbol denotes a set, i.e. {44}) from a set 
of images II, which correspond to a particular epipolar plane 1^ (Figure 2-2). 
Since all ofthelines^ in an epi polar-plane image SP k are projections of the 
same epi polar plane n^,for any given pointp in SP k , if the corresponding point 
in any other image nj is visible, then it will also be included in SR. Bolles, 
Baker and Marimont exploited this property to solve the correspondence prob¬ 
lem for several special cases of camera motion. For example, with images taken 
at equally spaced points along a linear path perpendicular to the optical axes, 
corresponding points form lines in the epi polar-plane image; therefore finding 
correspondences reduces to finding lines in the epi polar-plane image. 

For a given epipolar planer^, only those images whose camera centers lie 
on 14 ({Cj Cj • 14 = 0 }) can be included in epi polar-plane image SR. For ex¬ 
ample, using a set of images whose camera centers Neon a plane, an epi polar- 
plane image can only be constructed for the epi polar plane containing thecam- 
era centers. In other words, only a single epipolar line from each image can be 
analyzed using an epi polar-plane image. In order to analyze all of the points in 
a set of images using epi polar-plane images, all of the camera centers must be 
collinear. This can be a serious limitation. 

2.2 E pipolar I mages 

For our analysis we define an epipolar image S which is a function of a set of 
images and a point in one of those images. An epipolar image is similar to 
an epi polar-plane image, but has one critical difference that ensures it can be 
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constructed for every pixel in an arbitrary set of images. Rather than use pro¬ 
jections of a single epipolar plane, we construct the epipolar image from the 
pencil of epi polar planes defined by the line t through one of the camera cen¬ 
ters C* and one of the pixels p* in that imagen* (Figure 2-3). n* is the epi polar 
plane formed by t and theT h camera center Q. Epi polar line 4 contains all 
of the information about t present inn?. An epi polar-plane image is composed 
of projections of a plane; an epi polar image is composed of projections of a line. 
The cost of guaranteeing an epi polar image can be constructed for every pixel is 
that correspondence information is accumulated for only one point 1 p*, instead 
of for an entire epi polar line. 



Figure 2-4: Set of points which form a possible correspondence. 

To simplify the analysis of an epi polar image we can group points from the 
epipolar lines according to possible correspondences (Figure 2-4). Pi projects 
to pi in nj; thereforep! ({pi}) has all of the information contained inH about 
Pi. There is also a distinct set of pointsp 2 for P 2 ; thereforep, contains all of 
the possible correspondences for P^. If Pj is a point on the surface of a physical 
object and it is visible in both n? and n*, then measurements taken at p{ should 
(under thesimpleassumptions of this chapter) match those taken at p* (Figure 
2-5). Conversely, if Pj is not a point on the surface of a physical object then the 
measurements taken atp* are unlikely to match those taken atp* (Figures 2-6 
and 2-7). Epipolar images can be viewed as tools for accumulating evidence 
about the possible correspondences of p*. A simple function of j is used to build 

'To simplify the presentation in this chapter the discussion will focus on points, however 
(oriented) points and surfels are interchangeable. 
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an ordered set so that {P.,} is a set of points along t at increasing depths from 
the image plane. 




Figure 2-5: Pj is a point Figure2-6: Occlusion be- Figure 2-7: Inconsistent 
on a physical object. tween C* and P j. background 


2.3 Basic Algorithm 

An epipolar imaged is constructed by organizing 

j:r(p}) |J r (-) is a function of the image} 

into a two-dimensional array with i and j as the vertical and horizontal axes 
respectively. Rows in £ are epi polar lines from different images; columns form 
sets of possible correspondences ordered by depth 2 (Figure 2-8). The quality 
u(j) of the match 3 between column j and p* can bethought of as evidence that 
p* is the projection ofP,, andj is its depth. Specifically: 

■'« = £*(*■( p;),^(p*)), (2.D 

i 

where T’(-) is a function of the image and X(-) is a function which measures 
how well Tip)) matches ^(p*). A simple case is 

T(x) = intensity values at x 

and 

X(xi,x 2 ) = -||xi - X 2 \\ 2 . 

2 The depth of P ; or distance from C* to can be trivially calculated from j, therefore we 
consider j and depth to be interchangeable. We further consider depth and three-dimensional 
position to be equivalent since we use calibrated images and the three-dimensional position 
can be trivially calculated fromp* and the depth. 

3 I nterestingly, u(j) is related to the tomographic algorithm [Ramachandran and Lakshmi- 
narayanan, 1971, Geringand Wells, 1999], 
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£(i,i*,p*) 


i (image) 


Figure 2-8: Constructing an epipolar image. 

Real cameras are finite, and p) may not be contained in the i mage nf (p* ^ u(\. 
Only terms for which p* e n* should be included in (2.1). To correct for this, 
v(j) is normalized, giving: 


i '(j) 


E x (nv)),Hv')) 

i | pjen? 

i | pjer v 


( 2 . 2 ) 


Ideally, u(j) will havea sharp, distinct peak atthecorrect depth, so that 


argmax(z/(j)) = the correct depth at p*. 

j 

As the number of elements in p ; increases, the likelihood increases that v(j) 
will be large when lies on a physical surface and small when it does not. 
Occlusions do not produce peaks at incorrect depths or false positive^. They 
can however, cause false negatives or the absence of a peak at the correct depth 
(Figure 2-9). A false negative is essentially a lack of evidence about the correct 
depth. Occlusions can reduce the height of a peak, but a dearth of concurring 


4 Except possibly in adversarial settings. 
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images is required to eliminate the peak. Globally this produces holes in the 
data. While less then ideal, this is not a major issue and can be addressed in 
two ways: removing the contribution of occluded views, and adding unoccluded 
views by acquiring more images. 


A 

* 



Figure 2-9: False negative caused by occlusion. 



Figure 2-10: Exclusion region (grey) for surfel located at P, with normal n,. 

A large class of occluded views can be eliminated quite simply. Figure 2-10 
shows a surfel located at point P, with normal n r I mages with camera centers 
in the hashed half space cannot possibly have viewed P,-. n, is not known a 
priori, but thefact that P^ is visible in n* limits its possible values. This range 
of values can then be sampled and used to eliminate occluded views from lAj). 
Let a bean estimate of the normal 5 n, and C~p] be the unit vector along the ray 

5 Actually, a is the azimuth and the elevation is assumed to beO. This simplifies the presen¬ 
tation without loss of generality. 
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from C i toPj, then can only be visible in n? if QP ]-a < 0. We denote the set of 
image points with meet this visibility constraint asp“. At a fundamental level 
an estimated normal is associated with each reconstructed three-dimensional 
point (or surfel). For single pixels, the estimated normal is very coarse, but its 
existence is very useful for grouping individual points (or surfels) into surfaces. 
Neighborhood information, introduced in Chapter 3 as small planar patches, 
greatly improves the accuracy of the estimated normals. 

If the volume imaged by n ; is modeled (perhaps incompletely) by previous 
reconstructions, then this information can be used to improve the current re¬ 
construction. Views for which the depth 6 atp*, or d(p*), is I ess than the distance 
from C i to P j can also be eliminated. For example, if M 1 and M 2 have already 
been reconstructed, then the contributions of n? where % e {1,2,3} can be elim¬ 
inated from u{j) (Figure 2-9). In addition, we weight the contributions based 
on their for shortening. The updated function becomes: 


«) 


E(cT'“) x(Hvi).Hv')) 

i&Q 


EC.P, ■ a 

i&Q 


(2.3) 


where 



f 

pj- e n* ) 

Q = < 

r 

p ) e P“ 

d(pj-) > ||Ci - Pj|| 2 J 


Then, if sufficient evidence exists, 


argmax(z/(j, a)) 


j = depth at p* 
a = esti mate of rp 


2.4 Probabilistic Formulation 

Probabilities can be used to formulate the notion that v(j) and u(j,a ) should 
be large when P^ lies on a physical surface and small otherwise. Given image 
data I, q the probability that Pj lies on a physical surface, has depth j and 
orientation a, and is visible in n* can be written formally as 

q = p(d(p*) = j,o(p*) = a 1 1 ). (2.4) 


Using Bayes' rule gives 

_ p {I I d(p*) =j, o(p*) = a)p(d(p*) =j, o(p*) = a) 

“ Pi i) 


6 Distance from C, to the closest previously reconstructed object or point in the direction 
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Of all the image points in I only p" depend upon a and j. The rest can be 
factored out yielding 

p(Pj I d(p*) = j, o(p*) = a)p(d(p*) = j, o(p*) = a) 

Q = --- l 2 ' 5 ' 

P{Pj) 

log (q) = log(p(p“ | d(p*) = j,o(p*) = a)) + (2.6) 

l°g(p(d(p*) = J, o(p*) = a)) - log(p(p“)) 

If we assume that the measured pixel values ^(p*) contain Gaussian noise 
with a mean value of ^(p*) and a variance of o 2 and that individual pixel mea¬ 
surements are independent then the first term becomes 

~ E ((ep;) - +log (a™ 2 )) (2.7) 

which is very similar to u(j,a) (Equation 2.3). The next term of Equation 2.6 
is a prior on the distribution of depths and orientations. If a partial model 
already exists, it can be used to estimate these distributions. Otherwise we 
make the standard assumption that all j's and a's areequi-probable. 

The last term is more challenging; how do we estimate p(p")? Applying the 
same i ndependence assu mpti on as above yi el ds 

p{ Pj) = 5D lo g(p(p}))- 


Wecould assume that all values of T'(-) areequi-probable in all images, making 
the denominator irrelevant, and end up with a matching function equivalent 
to Equation 2.3. Another approach is to use nf to estimatep(p*). p{ p*) can be 
thought of as a significance term. The significance of a match between F($j) 
and ^(p*) is inversely proportional top(p}). We use a nearby (in both position 
and orientation) imagenf instead of nj toestimatep(p*). p*, C if and C k define 
an epipolar plane nf. To estimate the distribution we consider epipolar line 
if- k (projection of nf onto image plane nf). £ khk contains all of the possible 
matches for p* in nf and we use this set of possible matches and a mixture of 
Gaussians model toestimatep(pj) giving: 


p{ pJ) 


E G(7 r (p*),(T 2 ,7 r (p)) 


_ /)ki,k 

pe4 


E i 


p£i, 


ki,k 


( 2 . 8 ) 


Since p(p*) depends only upon p* and nf, it can be computed once as a pre¬ 
processing step. Figure 2-11 shows two probability images produced in this 
manner. Black represents low probability and white high. p( pf) can bethought 
of as a uniqueness term - p* matching p* is only significant if p(p*) is low. 
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Figure 2-11: Probability images for synthetic data. 

Our approach to estimating p(d(p*) = j,o(p*) = a |I) is conservative. It does 
not consider subsets of p" or allow individual el ements to be tagged as occluded 
or outliers. Nevertheless valuable insight was gained from examiningp(j|). In 
the next section results both with and without consideringp(p*) are presented. 


2.5 Results 



Figure 2-12: Example renderings of the model. 

Syntheti c i magery was used to expl ore the characteri sti cs of v(j) and u(j, a). 
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A CAD model of Technology Square, thefour-building complex housing our lab¬ 
oratory, was built by hand. The locations and geometries of the buildings were 
determined using traditional survey techniques. Photographs of the buildings 
were used to extract texture maps which were matched with the survey data. 
This three-dimensional model was then rendered from 100 positions along a 
"walk around the block” (Figure 2-12). From this set of images, a If and p* 
were chosen and an epi polar imaged constructed. E was then analyzed using 
two function, Equations 2.2 and 2.3where 

T{x) = rgb(x) = [r(x), g(x), b(x)] (2.9) 

and 

X([ri,g 1 ,b 1 \,[r 2 ,g 2 ,b 2 }) = (2.10) 

( (n - r 2 ) 2 + (gi - y 2 ) 2 + (h -b 2 ) 2 \ 

V a r a b )' 

r(x), g(x), and b(x) are the red, green, and blue pixel values at point x. <%, 
and a l are the variances in the red, green, and blue channels respectively. For 
thesynthetic imagery used in this chapter we set a? = o\ = al = 1. Elsewhere, 
the variances are estimated during the internal camera calibration and are 
assumed to remain constant. 

Figures 2-13 and 2-14 show a base image n* with p* marked by a cross. 
Under n* is the epi polar image £ generated using the remaining 99 images. 
Below £ is the matching function u(j) (Equation 2.2) and u(j,a) (Equation 2.3). 
The horizontal scale, j or depth, is the same for £, u(j) and v(j,a). The vertical 
axi s of £ i s the i mage i ndex, and of v(j, a) i s a coarse esti mate of the ori entati on 
a at Pj. The vertical axis of lAj) has no significance; it is a single row that has 
been replicated to increase visibility. To the right, u(j) and u(j,a) are also 
shown as two-dimensional plots 7 with thecorrect depth 8 marked by a line. 

Figure 2-13a shows the epi polar image that results when the upper left- 
hand corner of the foreground building is chosen as p*. Near the bottom of £, 
is close to horizontal, and p* is the projection of blue sky everywhere except 
at the building corner. The corner points show up in £ near the right side as 
a vertical streak. This is as expected since the construction of £ places the 
projections of in the same column. Near the middle of £, the long side to 
side streaks result because P, is occluded, and near the top the large black 
region is produced because p* ^ nf. Both u(j) and u(j,a) have a sharp peak 9 
that corresponds to the vertical stack of corner points. This peak occurs at a 

’Actually, is plotted for v(j,a). 

determined by intersecting £* with the model. 

9 White indicates the best match, black the worst. 
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Figure 2-13: n*, p*, S, v(j) and v(j,a) (Part I). 
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n* and p* 
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Figure 2-14: n*, p*, £, v(j) and u(j,a) (Part II). 
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depth of 2375 units 10 (J = 321) for u(j) and a depth of 2385 {j = 322) for u(j, a). 
The actual distance to the corner is 2387.4 units. The reconstructed world 
coordinates of p* are [-1441,-3084,1830] and [-1438,-3077,1837] respectively. 
The actual coordinates 11 are [-1446, -3078,1846]. 

Figure 2-13b shows the epipolar image that results when a point just on 
the dark side of the front left edge of the building is chosen asp*. Again both 
v(j) and u(j, a) have a single peak that agrees well with the actual depth. This 
time, however, the peaks are asymmetric and have much broader tails. This is 
caused by the high contrast between the bright and dark faces of the building 
and the lack of contrast within the dark face. The peak in u(j,a) is slightly 
better than the one in u(j). 

Figure2-13c shows the epi polar image that results when a point just on the 
bright side of the front left edge of the building is chosen as p*. This time u(j) 
and u(j,a) are substantially different. v{j) no longer has a single peak. The 
largest peak occurs at j = 370 and the next largest at j = 297. The peak at 
j = 297 agrees with the actual depth. The peak at j = 370 corresponds to the 
point where t exits the back side of the building. Remember that iAj) does 
not impose the visibility constraint shown in Figure 2-10. v(j,a), on the other 
hand, still has a single peak, clearly indicating the usefulness of estimating a. 

In Figure 2-14a, p* is a point from the interior of a building face. There 
is a clear peak in u(j,a) that agrees well with the actual depth and is better 
than that in u(j). In Figure 2-14b, p* is a point on a building face that is 
occluded (Figure 2-9) in a number of views. Both u(j) and v(j,a) produce fairly 
good peaks that agree with the actual depth. In Figure 2-14c, p* is a point 
on a building face with very low contrast. In this case, neither u(j) nor u(j,a) 
provide clear evidence about the correct depth. The actual depth occurs at 
j = 387.5. Both iAj) and u(j, a) lack sharp peaks in large regions with little or 
no contrast or excessive occlusion. Choosing p* as a sky or ground pixel will 
produce a nearly constant v(j) or u(j, a). 

Figures 2-15 and 2-16 show the result of running the algorithm on synthetic 
data. A grey circle and a black line are used mark the location and orientation 
of each image in the dataset. The a; and y coordinates have been divided by 
1000 to simplify the plots. For each pixel in the set of images shown in Figure 
2-12 an epi polar image was constructed and the point with maximum u(j,a) 
recorded. Points for which u(j, a) is at least -2 and |p ; | is at least 5 are plotted. 
Global constraints such as ordering or smoothness were not imposed, and no 
attempt was made to remove low confidence depths or otherwise post-process 
the maximum of u(j, a). I n Figure 2-15 the grey levels correspond to the density 
of reconstructed points, with black the most dense. In Figure 2-16 the grey 

10 1 unit is 1/10 of a foot. 

n Some of the difference may be due to the fact that p* was chosen by hand and might not be 
the exact projection of the corner. 
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Figure 2-15: Density of reconstructed points. 
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Figure 2-16: Orientation of reconstructed points. 
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Figure 2-17: Distribution of errors for reconstruction. 
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Figure 2-18: Two views of the reconstruction. 
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levels correspond to orientation. The grey scale axis to the right is in units of 
7r/2. The outlines of thefour buildings are clearly visible and the orientations 
are roughly correct. 

What is difficult to appreciate from these plots is the relative numbers of 
points contributing to the black lines compared to the grey outliers scattered 
about. Figure 2-17 shows the error distribution for the reconstruction. Plotted 
values indicate the percentage of reconstructed points which havenomorethan 
the indicated error. The error measure is 

||Pj (reconstructed) — Pj (actual) || / ||C* — Pj (actual) || . 

Note that > 85% of the reconstructed points are within 1 % of the correct value. 
Figure 2-18 shows the lower left corner of the bottom building as viewed from 
[-1600,-4000,1100] and [-2000,-3500,1200]. The results are rendered as ori¬ 
ented rectangular surfaces using the reconstructed position and estimated ori¬ 
entation. The size is set so that the projection in n* is 1 pixel. For clarity, 
the reconstructed points have been down-sampled by 3 in each direction. The 
recovered orientations may not be good enough to directly estimate the under¬ 
lying surface, but they are good enough to distinguish between surfaces. This 
idea will play an important role in Chapter 5. 

Next we explore several modifications to the basic algorithm described in 
Section 2.3. The synthetic images in Figure 2-12 were rendered with direc¬ 
tional lighting in addition to global lighting. As a result, image brightness 
varies with viewing direction. To compensate we add a brightness correction 
term 7 to the matching function 12 : 


*([7-1,01,61], [7-2,02,62]) 


7 


'(7n-r 2 ) 2 ( 70 i ~ 92 f (761-62) 

o r o "t" o 


erf 


err 


erf 



We also consider two variations of the probabilistic formulation developed in 
Section 2.4. In one we limit log(p(p“)) > -15 and the other we do not. The 
upper left image shown in Figure 2-12 was selected as If. 39% or 67,383 pix¬ 
els are reconstructble (i.e. non-sky and non-ground). Each pixel in If was 
reconstructed using the six variants shown in Table 2.2. Variants which in¬ 
clude brightness compensation are annotated with "w/ be” and those without 
"w/o be”. Variants which use the probabilistic formulation are annotated "+P” 
and thosethat limit the log probability have ">=^15” added. Figure 2-19 shows 
error distributions for the variants. Interestingly, variant 6 has the best error 


12 This is similar to normalized correlation, however we impose nonlinear constraints on 7. 



Points (°/t) Points (%) 
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Figure2-19: Error distribution for variants. 



Figure 2-20: Error distribution after adding noise. 
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distribution, but misses many reconstructive points. Conversely, variant 1 re¬ 
constructs nearly all of the points, but sacrifices some accuracy. Variant 3 is 
the overall best. 


Variant 

Label 

Points Reconstructed 

Number 

Percent 

1 

M w/ be 

65,582 

97 

2 

M+P w/be 

63,579 

94 

3 

M+P>=-15 w/ be 

62,564 

93 

4 

M w/obc 

34,710 

52 

5 

M +P w/o be 

24,281 

36 

6 

M+P>=-15 w/o be 

23,485 

34 


Table 2.2: Performance of algorithm variants. 

Finally, we consider the effects of camera calibration noise. Ultimately it is 
hoped that Argus will achieve positional accuracy on theorder of a few centime¬ 
ters and pose accuracy on the order of a few milliradians. Argus has not yet 
achieved this design goal, but we use it as a reference. Uniformly distributed 
noise was added to each coordinate of the position and to the orientation in 
increments of 2” and 0.1° respectively. The results are shown in Figure 2-20. 


2.6 Discussion 

This chapter defines a construct called an epipolar image and then uses it to 
analyze large sets of synthetic data. This analysis produces an evidence versus 
position and surface normal distribution that in many cases contains a clear 
and distinct global maximum. The location of this peak determines the po¬ 
sition. The algorithm presented uses only local calculations and lends itself 
nicely to parallelization. 




Chapter 3 

The Challenge of Noisy Data 


This chapter extends the approach presented in the previous one to work with 
noisy data. Specifically data which contains 1) camera calibration error; 2) 
significant variations in illumination; and, 3) complex occlusion (e.g. building 
viewed through tree branches) and image noise. With these modifications, our 
approach shifts from looking for sets of matching points to searching for sets 
of highly correlated regions. The fact that at a fundemental level we match 
surfels (small planar regions with position, orientation, and texture) is made 
explicit in this chapter. Theaddition of neighborhood information produces sev¬ 
eral benefits, such as accurate normal estimation, but also makes the epipolar 
image construct £, described in Chapter 2, less useful. £ can no longer be used 
to directly visualize possible correspondences. Hypothesizing points along t 
also becomes less attractive. This point wil be addressed furter in Chapter 4. 
Some additional notation used in this chapter is defined in Table 3.1. 


3.1 Camera Calibration Errors 

Theepipolar constraint exploited in Chapter 2 relies on accurate camera cali¬ 
bration. If the calibration is not accurate then the constraint that correspond¬ 
ing points must lie on a pair of epipolar lines (e.g. 4 and ^l in Figure 2-1) no 
longer holds. If the calibration error can be bounded then the epipolar lines 
become epipolar stripes. Consider perturbing Q and C 2 in Figure 3-1 about 
their centers of projection; the intersection of IT with nl and nf sweeps out 
t\ and l\. The area of the epipolar stripe is proportional to the calibration 
error. What was a one-dimensional search with perfect calibration is now a 
two-dimensional search. As shown in Figure 3-2, the actual projection of P is 
not p. If the bounded camera calibration error 5 is known thenf) can be calcu¬ 
lated and pej). The effect can be modeled as a displacement [u, v] of the actual 
projection. 

Calibration error complicates the epipolar image £ and its analysis de¬ 
scribed in Sections 2.2 and 2.3. p, does not contain all of the information in 
U\ about Pt we must consider P^. I nstead of an array (indexed by j and a) of 
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yj 
x j 

*i 

9 

n* 

r 

p;- 

pj 

p} 

p 

p 

« 

p 


j 


j 


n fc 

^T e 

pk,i 

? 

T(P,i,5) 


surfel. 

Point \nSj. x and y are indices intoS,-. gS^- = Pj. 
Projection of Sj onto ftp 
Set of projections of S J( |s*|. 

Point in s*. x and y are indices intos*. [js* = p*. 
Basesurfel. fe*}. May contain estimated values. 

M atching set of surfels. |s* s* matches s* }. 

Estimated center of projection for thei th camera. 
Estimated image plane for theP h camera. 

Cal i brated i mage (esti mated cal i bration). P = (n?, Q). 
I mage poi nt. P roj ecti on of P j onto n?. 

Set of projections of P Jf |pU. 

Noisy Projection of P^ onto ftp Area that 
P j projects to given bounded camera calibration error. 
Point in . u and v are indices into. 

Pp* where h* and ii produce the best match. 

Set of matching image points, jp* p* matches p*}. 
Reconstructed world point. Estimated usingp,. 

Set of noisy projections of P jt jf>‘ }. 

The/c th estimated epipolar plane. 

E pi polar stripe. Noisy projection of n£ onto ftp 
Bounded camera calibration error for P 
Noisy projection function, e.g. = f>*. 

Note: T(Pj,P,^) =p*. 


Table 3.1: Notation used for handling noisy data. 


individual pixels, £ is now composed of two-dimensional regions. Equations 2.2 
and 2.3 become 


”ti) 


E~ maxT’(y r (jjp}),^(p*)) 
i |pjen? 


E~ 1 

i |pjen? 


(3.1) 


and 

(C.P, ■ a) max |7j, J-{ p' I ) 

v(],a) = — -—^- (3.2) 

HC.P, • o 

i&Q 
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where 


[, 

p* e n* 
Pi e P“ 


1 

2 1 

i 

d(pj) > 

C, - P, 

J 


Equation 2.8 must also be modified. E pi polar stripe^ is used toestimatethe 
distribution instead of epipolar lineproducing 


Pi P$) 


E G(^(p*),<r 2 ,^(p)) 

efe i,fc 

_ Zki.k 

pe4 


(3.3) 


I n the last chapter, if a match was found the reconstructed point was auto¬ 
matically P j. This is no longer the case. p ; represents our best estimate for a 
set of corresponding points and iu and v t are shifts which correct for the cali¬ 
bration error. What world point P, gave risetopj? It almost certainly wasn't 
P j. Assuming that the calibration error can be modeled as zero mean additive 
Gaussian noise then the best estimate of P, is the one which minimizes the 
sum of squared calibration errors. Two possible measures of the calibration 
error are the distance (in three-dimensional space) between P,- and the line of 
sight through p*, 

argmin V Qp} x ((P j - Q) x C, ; p*) • (Pj - Q) (3.4) 

' p 


and the distance (in two-dimensional space) between p* and the projection of 

P •, T(P -,P) 

argmin ^ |p* - T(P i ,F)| 2 . (3.5) 

3 pep., 

The shifts {[?ij,nj]} introduce additional degrees of freedom. One benefit of this 
is that P j can be reconstructed from more than oneP^, allowing P, to be recov¬ 
ered even if P^ is not very close. This property will be exploited in Chapter 4. 
The additional degrees of freedom also admit additional false positives. Figure 
3-3 shows one example which occurs because similar looking points (the upper 
left corner of a black square) lie within each jP, (the grey circles). One way 
tolimit the number of false positives is to constrain theshifts. All of the shifts 
applied to points in a given image should be consistent with a single camera 
correction. This will be explored further in Chapter 5. 


3.2 Variations in Illumination and Reflectance 

The matching function proposed in Equation 2.10 assumes controlled lighting 
and diffuse reflectance. Only the overall brightness is allowed to change. Nat- 



3.2. VARIATIONS IN ILLUMINATION AND REFLECTANCE 


53 





Figure 3-3: False positive induced by compensating for camera calibration er¬ 
ror with shifts. 

ural lighting can vary significantly (e.g. time of day, season, weather, etc.) not 
only in brightness, but also chromatidty. Real scenes are also not restricted to 
objects which reflect diffusely. Both of these make it difficult to compare im¬ 
ages of the same region taken under differing illumination conditions and from 
different viewpoints. 

A simple linear model can be used to correct images taken under different 
viewing conditions (illumination and/or reflectance). The radiance R arriving 
at an image sensor from a given region is simply the irradiance/ at the region 
multiplied by its reflectance k\ 

R = hi. (3.6) 

and the value measured by the image sensor is: 

V = f (R) + d. (3.7) 

f(-), an invertible function, and b, an offset, are functions of the image sen¬ 
sor 1 . Figure 3-4 shows a region imaged under different conditions. Clearly, if 
k 2 I 2 /hh is constant for some region then R 2 /R 1 will also be constant over the 
same region. From Equations 3.6 and 3.7 it follows that if for some region 

I 2 k 2 = Rkic (3.8) 

Equation 3.7 is commonly cal led the image sensor's intensity response fundi on [Vora eta/., 
1997a, Vora eta/., 1997b], 
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zx 



Figure 3-4: Region imaged under different conditions. 


and 

f (xy) = f (®)f(y) 


then 


V 2 = f(c)l»i - f(c)di + d 2 


(3.9) 


or 


v 2 = mv\ + d. 


(3.10) 


Equation 3.9 is true for typical CCD cameras. Equation 3.10 applies to both 
diffuse and non-diffuse surfaces and can be calculated independently for each 
color channel. We often refer tom and d as an illumination correction. It 
allows us to remap image values obtained under one condition to another and 
then match them. Substituting Equation 3.10 into Equation 2.10 gives: 


X([r 1 ,g 1 ,b 1 ],[r 2 ,g 2 ,b 2 ]) = (3.11) 

/((m r ri + d T ) - r 2 f ((m g ^i + d s ) - g 2 f {{m h b\ + d s ) - b 2 ) 2 \ 

V )' 

The condition given by Equation 3.8 requires that the product of irradiance 
and reflectance vary similarly for changes in viewing conditions at each point 
x in the region 1Z\ 

YxeTZ: I 2 r 2 //in = c. (3.12) 

In practical terms, this means that 7 Z must be planar (or well approximated 
by a plane), uniformly lit, and contain materials with similar reflectance func¬ 
tions. If we are willing to assume that only a limited number of materials 
appear in each region, then using clustering techniques to calculate the cor¬ 
rection removes the requirement that the reflectance functions be similar. For 
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example, in the next chapter, a region containing a window with specular re¬ 
flection and concrete is successfully corrected with this technique. The spectral 
content of the irradiance may be changed arbitrarily from one condition to an¬ 
other and the region may contain different colored materials. If Equation 3.9 
is not valid and the intensity response function is known, then image values 
can be converted to radiance and the remapping done there. 

A single pixel was sufficient to calculate v(-) as presented in Sections 2.3 
and 3.1. This is no longer the case once we consider illumination. I n order to 
calculate m and b at least 2 pixels are needed and for good results more should 
be used. Consider matching images of a surfel (small planar region) S, with 
orientation n, located at Pj. Since orientation is intrinsic to a surfel, Equation 
3.1 is no longer useful. We can rewrite Equation 3.2 as 


u (Sj) 


i&Q 


c* p i • 11 / 


max 

u.v 


■ >v 

i&Q 



(3.13) 


In order toobtain a match, S j must beclose(both in position and orientation)to 
a physical surface and Equation 3.12 must be satisfied. Correcting for viewing 
conditions allows us to match images taken under different illumination con¬ 
ditions and from different viewpoints, but also admits false positives. Figure 
3-5 shows one example. One way to limit the number of false positives is to 
constrain the correction (mr,d T ,m g ,d g ,m h ,d h ). This is explored in Section 4.2. 


There are several drawbacks to matching surfels instead of individual pix¬ 
els: the cost of computing the match is higher; the shifts introduced in Section 
3.1 vary with image position and the distance to the imaged point and there¬ 
fore are not constant throughout the surfel; and, calculating p(p*) is problem¬ 
atic. For small calibration errors and small regions it is reasonable to assume 
that the shifts are piecewise constant and because of the additional informa¬ 
tion content of a surfel, p(pj) is less important. On the positive side, the image 
data of a surfel can be used to efficiently calculate d and v % and accurately 
estimate its normal (Sections 4.2 and 4.3). 


3.3 Occlusion and I mage Noise 

Occlusion poses another difficult problem. Figure 3-6 shows two examples typ¬ 
ical of urban environments. On the left, the foreground building completely 
occludes two columns of windows. This view provides no information about 
the occluded region and if enough other views are completely occluded we will 



56 


CHAPTER 3. THE CHALLENGE OF NOISY DATA 



Figure 3-5: False positive induced by correcting for viewing conditions. 




Figure 3-6: Flard (left) and soft (right) occlusion. 
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not be able to reconstruct that portion of the background building. The exam¬ 
ple on the right is potentially more difficult. The tree only partially occludes 
the building. The occluded region contains a significant amount of information 
which we would like to use in the reconstruction process. If we are matching 
individual pixels this is simple: either a pixel is occluded and does not match 
well or it is not and matches. On theother hand, multi-pixel surfels can contain 
both occluded and unoccluded pixels. 


— 


■ 

1 

s* 

h J 


■ 

* 

Mask 


Match 



■ 

I 

S z 

□ 

Match 


Figure3-7: Masks. 

More generally, either a pixel contains data which is representative of a 
surface that we intend to model, or it is an outlier. Thetree pixels in Figure 3-6 
are outliers. Other examples are transient objects (cars, people, etc), imaging 
aberrations (sensor noise, blooming), and specular reflections. To account for 
this we use a mask and allow individual pixels to be tagged as outliers and 
not count towards the match score. Deciding which are outliers and which are 
not is the hard part. One possible approach, shown on the left side of Figure 
3-7, is to tag every pixel which matches poorly as an outlier. Black indicates 
pixels tagged as outliers and poor matches. Clearly the pair (s), sj) should not 
be considered a match. Less than 20% of the pixels are tagged yet a perfect 
match is obtained. This approach admits too many false positives, particularly 
in conjunction with shifts and illumination corrections. A more conservative 
approach is shown on the right side of Figure 3-7. A pixel is tagged as an 
outlier only if the match is equally poor throughout a small region (the eight 
nearest neighbors) around the corresponding pixel. This assumes that thetwo 
regions have already been aligned. Notice that the 1 pixel that which is truly 
an outlier is tagged as such and fewer pixels are inappropriately tagged as 
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outliers. I n some cases, the number of outlier pixels is so high that the entire 
region should be tagged as an outlier. Equation 3.13 can berewritten to account 
for both individual pixel and entire region outliers as follows: 


v(S i) = 


ieS v 


/ 



max 

U,V 


EW(! S, 


i&Q 


(3.14) 



where 



p j- e n \ 




p 5 e P? 


o 


d(pj) > 

Q - Py 

z 

i 

s* 7^ out 

ier 

/ 



%Sj e S, 

) / 'El >0.5 

< 



x 3 3 


and M\ y x Sj e Sj) returns 0 if the pixel has been tagged as an outlier and 1 
otherwise. 

Equations 3.1, 3.2, 3.13, and 3.15 all match against a distinguished element 
p* or more generally s*. This assumes that s* is representative of the actual 
appearance of the underlying world surface. In the presence of occlusion and 
image noise this is not always the case. Outliers in s* are not likely to match 
the corresponding data in any of the other images resulting in the mismatch 
penalty being applied im times. One way to mitigate this effect is to first 
estimates* from the image data and then perform the matching. With camera 
calibration error and variations in the viewing condition, as well as outliers in 
the image data, this is a difficult task. Instead we successively set s* = s*. We 
can exhaustively test the sj's or if we are willing to occasionally miss a match 
we can simply try a few of the best. 


3.4 Discussion 

This chapter introduces several powerful techniques for dealing with data that 
contains 1) camera calibration error; 2) significant variations in illumination; 
and, 3) difficult occlusion and image noise. As pointed out above, over-fitting 
is a concern. The next chapter applies these techniques directly to a large 
dataset. And the following chapter imposes several geometric constraints to 
prune false positives. 



C hapter 4 

From I mages to Surfels 


Chapter 3 discussed several methods to compensate for noisy data. This chap¬ 
ter will explore these methods in practice. We will focus on two char act eristics: 
the ability to detect nearby (in both position and orientation) surfaces and once 
detected, /oca//'zetheir actual position and orientation. 


4.1 The Dataset 

A Kodak DCS 420 digital camera mounted on an instrumented platform was 
used to acquire a set of calibrated images in and around Technology Square 
(the same office complex depicted in the synthetic imagery and Figure 1-4) 
[De Couto, 1998]. Nearly 4000 images were collected from 81 node points. 
Other than avoiding inclement weather and darkness, no restrictions were 
placed on the day and time of, or weather conditions during, acquisition. The 
location of each node is shown in figure 4-1. At each node, the camera was ro¬ 
tated about thefocal point collecting images in a hemi-spherical mosaic (Figure 
1-6). Most nodes are tiled with 47 images. The raw images are 1524 x 1012 pix¬ 
els and cover a field of view of 41° x 28°. Each node contains approximately 70 
million pixels. After acquisition, the images are reduced to quarter resolution 1 
and mosaiced [Coorg eta/., 1998, Coorg, 1998]. Equal area projections of the 
spherical mosaic from two nodes are shown in Figure 4-2. The top node was 
acquired on an overcast day and has a distinct reddish tint. The bottom image 
was acquired on a bright clear day. Significant shadows are present in the bot¬ 
tom image whereas the top has fairly uniform lighting. Following mosaicing, 
the estimated camera calibrations are refined. 

After refinement, the calibration data is good, but not perfect. The pose 
estimates are within about 1° and 1 meter of the actual values 2 . As described 
in Section 3.1, these errors produce an offset between corresponding points in 
different images. A 1° pose error will displace a feature by over 8 pixels. Our 

^l X 253 pixels. 

2 An interactive tool was used to manually identify a number of correspondences which were 
then used to bound the camera calibration error. 
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Q 

Figure 4-1: Node locations. 

calibration estimates are in an absolute coordinate frame which allows us to 
integrate images regardless of when or from what source they were collected. 
This greatly i ncr eases the quantity and qual ity of avai I able data, but because of 
variations in illumination condition (Section 3.2) also complicates theanalysis. 

Figure4-3and 4-4 show several images from our dataset reprojecterf ontoa 
surfel which is coincident with an actual surface. The location, orientation, and 
size of the surfels used are shown in Table 4.1. Surfel 1 was used to generate 
Figure 4-3 and surfel 2 Figure 4-4. If the camera calibration estimates were 
perfect and the illumination was constant, the regions in each figure should be 
identical 3 4 . The misalignment present in both sets is the result of error in the 
calibration estimates. Figure4-3 is representative of the best in the dataset. A 
large number of the source images have high contrast and none of the regions 
are occluded. The third row has a distinct reddish tint. The four images in 
the center of the last row were collected under direct sunlight. And, the last 
two images were taken near dusk. Figure 4-4 is more typical of the dataset. 
It is lower in contrast and some of the regions are partially occluded by trees. 


3 Using the estimated camera calibration. 

ignoring errors introduced during image formation (discretizing into pixels, etc) and re¬ 
sampling. 
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Figure 4-2: Example nodes. 


Surfel 

Position 

Normal 

Size (units) 

Size (pixels) 

X 

y 

z 

X 

y 

z 

X 

y 

X 

y 

1 

-1445 

-2600 

1200 

-1 

0 

0 

110 

110 

11 

11 

2 

-280 

-3078 

1500 

0 

-1 

0 

110 

110 

11 

11 


Table 4.1: Surfel parameters. 
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Figure 4-3: Reprojection ontosurfel 1 coincident with actual surface. 



Figure4-4: Reprojection ontosurfel 2 coincident with actual surface. 
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Figure4-5: Source images for selected regions of surf el 1. 
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Figure 4-6: Source images for selected regions of surf el 2. 
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Figures 4-5 and 4-6 show source images with the reprojected area marked by a 
circle for several of the regions shown in Figure 4-3 and 4-4. I n the worst cases 
all views of a surfel are similar to the upper left image in Figure 4-6. 


4.2 Detecting Surfels 

This section focuses on using i/(S/ to detect nearby (both in position and ori¬ 
entation) surfaces. The following algorithm lists the major steps in our imple¬ 
mentation. 

1. FIypothesizesurfel Sj in world coordinates. 

2. Select images!. 

3. Reproject regionss,. 

4. Select next s*. 

5. For each region s/ 

(a) Determine best shift («*,6/. 

(b) Estimate col or correction (mr,d r ,m g ,d g ,m b ,db)- 

(c) Calculate best match. £ X(?(£*$),TQfl)) / £ 1 

x S , 6 S j / 'I S 'y (xS j 

6. Evaluate match set u( S/: 

• If good enough =>■ done. 

• If not => goto4. 

In Chapter 2 P/s were determined by sampling along t. I n essence, the im¬ 
age data directly determined the test points. 500 j's and 25 a's were evaluated 
for each p*. Given the dataset described above, this would require evaluating 
~ 5 x 10 12 points. Since our calibration estimates are oriented in an absolute 
coordinate system, a more efficient approach would be to choose test points in 
world coordinates. Sampling the volume of interest (from ground to 2000 units 
and containing all nodes) every 100 units in position and at 8 different orien¬ 
tations would test less than 2 x 10 6 points - more than six orders of magnitude 
fewer. 

Once a surfel (such as shown in Table 4.1) is selected, we constructl. Only 
cameras which image the front side of the surfel and are not too close or too 
far are considered. We use 120 units as the minimum distance and 6000 as 
the maximum. In addition, for most of the results presented in this thesis we 
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limit |I| to 20 images. If necessary, |I| is reduced by choosing the subset which 

maximizes the variation in viewpoint. We use the projection of S^Q onto the 
plane containing the surfel as a measure of the viewpoint. The subset which 
maximizes the minimum distance between the projections also maximizes the 
viewpoint variation. 

Each image in I is then reprojected on the the surfel, producing sets of 
regions Sj similar to those shown in Figures 4-3 and 4-4. Tofacilitate choosing 
s*, we define a region's interest as 


interest = 


C,P, • a 


u.v Vu 2 + V 2 £ 1 

V |s,-es j ) 



(4.1) 


where 


K v) e {(-i, -i), (o, -l), (l, -l), (-1, o), (l, o), (-1, l), (o, l), (l, i)} 

and then choose the largest one first. This gives priority to regions with in¬ 
teresting textures (i.e. ones which would produce a significant match), better 
contrast, and unforeshortened views. Only regions with a minimum interest 
(we typically use 50 as the threshold) need even be considered. In fact, at most 
we try the top fives*'s. Region 5 (top row, fifth from the left) is the most inter¬ 
esting in Figure 4-3 with a score of 575.5. Region 2 is the most interesting in 
Figure 4-4 with a score of 34.5 5 . 

Next we consider finding the best match between s* and s*. We do this by 
evaluating: 

max e(u, v) (4.2) 

u,v 

where 

0= £ X(F(g$),?(*$)) / £ 1. (4.3) 

Steps 5a, 5b, and 5c are actually accomplished simultaneously when finding 
the best match but for our discussion we will consider the steps separately. 

Figures 4-7 and 4-8 show -e versus u, x shift, and v, y shift, for three regions 
each from Figures 4-3 and 4-4. The left column shows the error near the cor¬ 
rect shift plotted as a three-dimensional surface. Thecenter and right columns 
show the projection onto the x-error plane and y-error plane respectively. Re¬ 
gion five from Figure4-3 is used ass* for all of the plots in Figure4-7. The plots 
are periodic because of the repetitive texture (windows on the on the building). 

5 For this example we have lowered the threshold to 25 otherwise this region would not be 
considered. 
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Figure 4-7: Shift error space near correct shift (left), projection onto the x- 
error plane (center), and y-e rror plane (right). The periodicity is caused by 
repetitive texture (windows) on the buildings. The change of periodicity in the 
x and y directions is produced by the window spacing in the horizontal and 
vertical directions. The change in periodicity from row to row is the result of 
foreshortening. 
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Figure 4-8: Shift error space near correct shift (left), projection onto the x- 
error plane (center), and y-e rror plane (right). The periodicity is caused by 
repetitive texture (windows) on the buildings. The change of periodicity in the 
x and y directions is produced by the window spacing in the horizontal and 
vertical directions. The change in periodicity from row to row is the result of 
foreshortening. 
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The different periodicity in the a; and y directions reflects the horizontal and 
vertical spacing of the windows. The first, ninth, and last regions of Figure 4-3 
were used as s* for the first, second, and third rows of Figure 4-7 respectively. 
Region two from Figure 4-4 is used as s* for all of the plots in Figure 4-8. The 
first, fourth, and eleventh regions of Figure 4-4 were used as s) for the first, 
second, and third rows of Figure 4-8 respectively. 

The s*-'s used in the first and third rows of Figure 4-7 and the second row 
of Figure 4-8 originate from cameras which imaged thesurfel froma relatively 
oblique viewpoint. Thes* for the third line of Figure 4-7 is a very low contrast 
image acquired at dusk and thes* for thefirst line of Figure 4-8 contains signifi¬ 
cant soft occlusion from tree branches. Notethat a shift is required in each case 
to obtain the best match. All of the plots are smooth with well defined minima 
making Equation 4.2 a good candidatefor optimization techniqueslPress etal., 
1992]. We have tried a number techniques including direction set methods (e.g. 
Powell's method) and conjugate gradient methods? 5 and all worked well. Mul¬ 
tiple minima, however, area concern. We consider only the nearest minimum 
and use S l to limit the range of valid shifts. For the results presented in this 
thesis we limited u and v to ±5. I n many cases, only a single minimum exists 
within this limited range. The multiple minima, particularly those in Figure 
4-7, can lead to false positives I ike the one illustrated in Figure 3-3. 

The color correction is determined using linear regression [Press et a!., 
1992] and limiting the correction, s* is used as the x data and s* as the y. 
Each color channel (r, g, and b) is computed separately. Changes in both illu¬ 
mination and reflectance are factored into equation 3.10, however we assume 
that changes in illumination dominate the correction. Given this assumption, 
m r , m g , and m h are constrained to be positive. I mages collected under a clear 
sky tend to have a blue tint and those collected under cloudy skies have a red¬ 
dish tint. While changes in the spectral composition of natural lighting are 
significant they are limited. Consider the vector [rrk,m g ,m h \ m , If brightness is 
theonly change between s* and s*, the vector will be in the direction [1,1,1]. We 
use the angle between [m r ,m g ,m h \ and [1,1,1] as a measure of the variation in 
spectral composition and limit it to 10°. Further, we limit the overall change in 
brightness for any channel (m) to a maximum of 10 and a minimum of 1/10. If 
the best correction exceeds any of these limits, the match is not allowed. 

Figures 4-9 and 4-10 show the regions from Figures 4-3 and 4-4 with the 
shifts and color corrections applied 6 7 . One e(u h Vi) and (m r ,d T ,m g ,d g ,m h ,d h ) have 
been determined calculating the match score is a straight forward evaluation 


6 Deriving V Ui „e(u, v) is straight forward and depends only upon the image data and the 
gradient of the image data. See Appendix B for details. 

7 The correction for the4 th region in row 3 of Figures 4-3 and 4-9 which corrects a specular 
reflection in the window exceeds the limits described above. 
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Figure 4-9: Aligned and corrected regions for surfel 1. 



Figure 4-10: Aligned and corrected regions for surfel 2. 
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Region 

Match 

Score 

Unique 

Shift 

Slope 

1 ntercept 

x 

y 

r 

9 


r 

9 

5 

1 

X 

-14.8 

29.2 

E&l 

mm 

mm 

mm 

mm 

-29.1 

-28.5 

-47.2 

2 

X 

-15.0 

28.6 

ESI 


mm 

mm 


-25.8 

-28.3 

-38.8 

3 

X 

-24.4 

18.3 

-2.0 


mm 

mm 

■B 

-35.2 

-33.7 

-48.4 

4 

X 

-28.3 

22.1 

-2.0 

ESI 

1.5 

1.5 

1.5 

-39.5 

-33.9 

-46.3 

5 

X 

-0.0 

oo 



1.0 

1.0 

1.0 

0.0 

0.0 

0.0 

6 

X 

-6.8 

64.0 



1.0 

1.0 


-5.3 

-0.5 

4.3 

7 

X 

-31.8 

14.9 

-3.3 

-0.7 

mm 

1.4 

■B 

-15.4 

-12.5 

-37.0 

8 

X 

-23.7 

19.8 

-3.2 

-0.5 

■SI 

mm 

■B 

-4.5 

-11.9 

-14.1 

9 

X 

-15.1 

30.0 


ESfifl 

■SI 

mm 

■B 

-17.4 

-19.4 

-38.3 

10 

X 

-25.4 

18.4 


0.1 

■SI 

mm 

mm 

-14.6 

-18.8 

-33.6 

11 

X 

-34.5 

13.9 

-3.9 

■a 

mm 

mm 


-31.2 

-24.7 

-50.3 

12 

X 

-40.6 

12.0 

B£E1 

gum 

mm 

■B 

■B 

-33.1 

-32.9 

-36.7 

13 

X 

-50.7 

9.7 

-5.0 

■a 

mm 

mm 

1.4 

-9.1 

-16.3 

-21.4 

14 


-152.4 

3.8 


-0.8 

mm 

3.9 

3.9 

-194.0 

-153.0 

-222.8 

15 


-132.8 

4.1 

K2fl 

0.2 

mm 


■u 

-245.1 

-156.9 

-226.1 

16 


-329.2 

1.7 

E£l 

-0.0 

■SI 

■SI 

■21 

-569.0 

-384.2 

-491.7 

17 


-1409.6 

0.9 

ESI 

0.2 

mm 

MM 

mm 

232.4 

195.2 

256.5 

18 


-781.6 

1.1 

7.0 

Q| 

2.0 

mm 


-176.6 

-86.9 

-102.0 

19 


-162.4 

3.3 

-7.0 

i» 

3.8 

3.9 

K2i 

-190.0 

-140.3 

-224.3 

20 


-190.0 

3.1 

-7.0 

■19 

mm 

3.6 

■21 

-164.3 

-119.6 

-198.7 

21 


-132.0 

4.0 

-7.0 

■m 


3.9 

\mm 

-216.6 

-137.1 

-218.0 

22 


-128.3 

4.2 

-6.6 

EB 

mm 

■B 

mm 

1.7 

-5.9 

-31.8 

23 


-904.4 

1.1 

E2fl 

ESJ 

mm 


0.7 

-104.0 

-55.6 

-19.9 

24 


-960.6 

1.1 


-1.0 

gilil 


0.5 

-74.4 

-54.5 

-4.2 

25 


-1115.9 

1.0 


mm 

0.7 

0.5 

0.3 

-38.3 

-0.8 

48.2 

26 


-924.5 

1.1 

-1.0 


B2J 


0.7 

-72.9 

-57.7 

-32.2 

27 


-1052.3 

1.0 

-3.8 

-3.0 

mm 


bsisi 

-181.7 

-128.3 

96.1 

28 

X 

-66.4 

6.8 

-0.7 


■1:1 

■21 


-153.1 

-107.5 

-187.8 


Table4.2: Match data for surfel 1. 


Region 

Match 

Score 

Unique 

| Shift 

Slope 

1 ntercept | 


LJlj 

r 

g 

LJLJ 

r 

g 

5 

1 


-177.2 

0.9 

mwtm 


1.0 

1.0 

■iM 

55.5 

49.2 

73.2 

2 

X 

-0.0 

OO 



1.0 

1.0 

110^ 

0.0 

0.0 

0.0 

3 

X 

-21.3 

2.7 

-0.1 

■SI 

1.0 

mm 

■a 

15.0 

-13.5 

-24.6 

4 

X 

-17.4 

3.2 

-0.1 

0.7 

mm 

kh 

mm 

-3.5 

-10.9 

-14.6 

5 


-116.3 

1.0 

hsisi 

-0.1 

mm 

■SI 

mm 

-37.2 

-47.1 

23.5 

6 


-249.6 

0.8 

-7.1 

HI 



■IBB 

53.1 

31.2 

90.1 

7 


-232.5 

0.9 

-7.0 

wsm 


1 0.9 

0.1 

-86.0 

21.4 

125.8 

8 


-233.2 

0.9 

■2*1 

Biji 


BsSil 

-1.0 

213.9 

162.7 

233.4 

9 

X 

-31.7 

1.9 

1.5 

■B 

mm 

1.7 

■a 

-24.7 

-32.9 

-19.3 

10 


-116.2 

1.0 

-0.2 



1.7 

1.5 

-34.8 

-16.0 

-3.3 

11 


-75.3 

1.3 

-0.3 


mm 

■B 

1.8 

-37.2 

-26.2 

-18.4 

12 

X 

-41.2 

1.8 

®ss# 

LA°J 


1.7 

■a 

-40.0 

-19.4 

-16.0 

13 

X 

-44.9 

1.8 

mm 

mm 


1.7 

BB 

-14.3 

-17.4 

-13.1 

14 

X 

-59.9 

1.5 

mm 

mm 


sna 

mm 

-34.5 

-19.5 

-10.8 

15 

X 

-46.5 

1.8 

mm 

0.2 

mm 

■a 

mm 

-5.6 

-15.6 

-5.2 

16 

X 

-44.1 

1.8 

_015j 

0.2 

mm 

BB 

■a 

-7.1 

-21.1 

-17.6 

17 


-79.8 

1.2 

££fl 

|j|j| 

mm 

mm 

mm 

-56.7 

-0.8 

-100.3 

18 


-69.1 

1.2 

1 - 1 - 8 


1 4.5 

■SI 

1 4 - 4 

-50.3 

-21.7 

-65.0 


Table4.3: Match data for surfel 2. 
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of X(r(£2$),r(v$)) / 1 - To help determine the significance of the 

X S '/' 6 Sj ! I S , GSj 

match between s* and s* we use a measure called uniqueness which is related 
tothesharpness of the correlation function peak. A match's uniqueness is the 
rati oof the best match to the average match of the eight nearest neighbors and 
is defined to be 


where 






u.v 


uniqueness = 


ISj&j 


J 


Ei 

u,v 


E / E 1 


ISjGSj 


ISjGSj 


(4.4) 


(«, u) e {(-1, -1), (0, -1), (1, -1), (-1,0), (1,0), (-1,1), (0,1), (1,1)}. 


A region is considered a match if theshift and color correction arewithin limits, 
at least half of the points in s* contribute to the match and the match score is 
good enough and unique enough. For the results presented in this thesis we 
require the match score to be> -100 and the uniqueness 8 to be> 2 . Tables 4.2 
and 4.3 show the match data for the regions in Figures 4-9 and 4-10. 

Finally, we evaluate the match set. We retain a surfel if the match set 
contains a minimum number of regions, the regions come from a minimum 
number of nodes, z/(S j) is greater than or equal to a minimum value and s* has 
an interest which is greater than or equal toa minimum value. For the results 
presented in this thesis we require the number of regions to be > 6 , the number 
of nodes to be > 5, z/(S j) > -100, and interest > 50. The sets of regions shown in 
Figures 4-3 and 4-4 both produce valid matches. 

4.2.1 Results 

To test the detection characteristics of our algorithm, we evaluated i/(Sj) for 
many surfels near an actual surface. A 100 x 100 x 200 unit volume, theshaded 
area in Figure 4-11, was selected so that an actual surface divides the surface 
into two cubes. 396 positions chosen at 20 units intervals were used as test 
points. Each test point was evaluated every 5° for azimuths ±45° and every 
5° for elevations ±45°. A total of 142,956 surfels were evaluated. Figure 4- 
12 shows the fraction of the nearly 13,000 surfels tested at each displacement 
that produced a valid match set. The left side of Figure 4-13 shows the frac¬ 
tion of detections versus displacement and the angle between the actual and 


8 While growing surfaces (discussed in the next chapter) we allow uniqueness as low as 1.5. 
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tested orientation. The right side shows the fraction of detections versus az¬ 
imuth and elevation 9 . Wealso investigated theeffect of surfel size on detection 
rate. As a control, we measured the detections in the empty volume shown 
as a shaded region in Figure 4-14. An actual surface was selected and tessel¬ 
lated with surfels. The fraction of false negatives (1- the detection rate) for 
an unoccluded surface versus the fraction of true negatives (1- the detection 
rate) for the empty volume is shown in Figure 4-15. Surfel sizes from 5 units 
(lower left corner) to 50 (upper right corner) in increments of 5 units are plotted 
as diamonds. The "bump” in Figure 4-15 is the result of noise and the discrete 
nature of surfels. Similar curves for an occluded (significant tree coverage) sur¬ 
face are shown in Figure 4-16. The left and right plots in Figure 4-16 are for 
the same surface. The data shown in the right plot utilizes the mask function 
described in Section 3.3. A surfel size near 10 units gives the best all around 
performance. The best surfel size is a function of the data and is related to 
the size in world coordinates of the significant image features - in our case the 
windows. Potentially scale-space techniques such as [Li and Chen, 1995] can 
be used to determine the best surfel size. 


4.3 Localizing Position and Orientation 

As shown in the previous section, our method is capable of detecting surfaces 
a significant distance (both in position and orientation) from S,-. The algorithm 
described so far produces a set of corresponding textures, %. The underlying 
surface which gave rise to the textures may have a position and orientation dif¬ 
ferent from that of Sj. The information contained ins, can be used to estimate 
the position and orientation of the underlying surface. 

Once a surface has been detected, we localize its position and orientation 
using the foil owing algorithm: 

1. Until convergence: 

(a) Update surfel position P,-. 

(b) Update surfel orientation n,-. 

(c) Reevaluate//(Sj). 

Steps la, lb, and lc are actually interdependent and ideally should be calcu¬ 
lated simultaneously. For ease of implementation and to speed convergence 10 

9 AII surfels tested at a particular azimuth and elevation regardless of displacement are 
included in the fraction. 

^Theoretically, convergence is 0(n 3 ) where n is the number of parameters optimized, thus 
two half-sized problems converge four times faster than the full-sized one. We have observed 
this speed up in practice. I n addition, the symbolic gradient expression needed by conjugate 
gradient methods is much simpler for the two half-sized problems. 
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we have separated them. As will be seen in this section, we obtain good results 
performing the steps individually. 

We use a slightly modified version of Equation 3.4, which considers only 
matching regions, to update P,,: 


argmin E Cipj x ((P, - C.) x C,p5) ■ (P, - C(). (4.5) 

P i • o _ • 


When P j is updated, u % and v t must also be updated, s* changes with the new 
P j but p* does not, thus {ii u fy) must be corrected for the difference between the 
old and new s*. 

A match's dependence upon S j and can be made more explicit byrewriting 
Equation 3.15 as follows: 


v(Sj) 


ieQ v 7 


max e(n.) 

u,v v J 


E&P • u, 

i&Q 


(4.6) 


where 

£(n,) = - - . (4.7) 

%Sjes j 

To update we evaluate: 

argmax E'(“j)- ( 4 - 8 > 

3 i&Q 

Both direction set methods and conjugate gradient methods 11 work well to 
solve Equation 4.8. 

Once the position and orientation of thesurfel have been updated, wereper- 
form Steps 2, 3, 5, and 6 of the detection algorithm. Step 2 is only partially per¬ 
formed. Images with currently matching regions are retained. Images which 
no longer view the front side of the surfel are eliminated and additional ones 
that do are added. Steps la, lb, and lc of the localization algorithm are re¬ 
peated whilethe match score is improving until the position update is < 1 unit 
and the orientation update is < 1° or for 3 iterations if the match score is not 
improving. 

n While not as straight forward as in the last section, deriving V nj e(ip) leads to an easily 
evaluated expression that depends only upon known quantities such as the image data, the 
gradient of the image data, and the position and orientation of surfel. See Appendix B for 
details. 
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Figure4-17: Surfel localization. 
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Figure4-18: Localization examples. 
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Figure4-19: Localization summary. 
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4.3.1 Results 

Figure4-17 depicts the localization process. I mages which viewthefront side of 
thesurfel are shown in grey. Those that contribute to the match set are shown 
in black. Lines of sight through the center of the matched regions and building 
outlines are also shown in grey. Figure 4-17a shows the initial detection for 
a surfel displaced from the actual surface by 100 units in positions and 3CP 
in orientation. Figure 4-17b shows the surfel after updating its position and 
orientation (Steps la and lb of the localization algorithm). Figure 4-17c shows 
a new match set using the updated position and orientation (Step lc). Figure4- 
17d shows thefinal match set at convergence. The same set of test points used 
to test detection (Figures 4-12 and 4-13) was also used toevaluate localization. 
Figure 4-18 shows two test points evaluated at 361 different orientations. The 
test point for the plot on the left is 80 units in front of the actual surface and 
theonefor the right is 100 units behind. The asymmetry in both of these plots 
is caused by the asymmetric distribution of cameras shown in Figure 4-24. A 
diamond is plotted at the initial orientation for each detection and a pi us marks 
the final estimated orientation. Figure 4-19 shows the aggregate results for 
the complete set of test points. The plot on the left shows final displacement 
versus the initial displacement and the angle between the actual and tested 
orientation. The plot on the right shows the angle between the actual and final 
estimated orientation versus the initial displacement and the angle between 
theactual and tested orientation. For displacements upto 100 units in position 
and 30 ° in orientation, nearly all of the test points converged to the correct 
values. 


4.4 Discussion 

So far we have focused on detecting and localizing nearby surfels and have not 
addressed bogus matches. As pointed in Chapter 3, compensating for noisy 
data admits false positives. Figure 4-20 shows the raw surfels 12 detected and 
localized near one of the buildings imaged in the dataset. There are a signif¬ 
icant number of false positives. Notice the parallel layers of surfels near the 
upper right corner of the building. These are false positives similar to the one 
shown in Figure 3-3. Many of the surfels are near the surfaces of the building. 
Figure 4-21 shows the distribution of distances to the nearest model surface. 
Figures 4-22 and 4-23 show the raw results for thefull reconstruction and Fig¬ 
ure 4-24 shows the volume used for the full reconstruction. The next chapter 
addresses ways to eliminate false positives. 

The reconstructions shown in this thesis were performed on quarter reso- 


12 Only front facing surfels are rendered. 
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Figure 4-22: Raw surfels (fu11 reconstruction). 
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Figure4-23: Distance to nearest model surface (full reconstruction). 
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Figure 4-24: Reconstruction volume (shaded area) used for full reconstruction. 
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Figure 4-25: Raw surfels for full reconstruction using half (top) and eighth 
(bottom) resolution images. 
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lution images. To test the effects of image resolution, we reran the raw detec¬ 
tion and localization algorithms on half and eighth resolution images. Several 
parameters (maximum shifts, interest, and uniqueness) are resolution depen¬ 
dent and these were adjusted appropriately. The results are shown in Figure 
4-25. The half resolution reconstruction is very sparse. We suspect the primary 
cause is image noise. It turns out that half the output resolution actually corre¬ 
sponds to the full resolution of the physical sensor [DeCouto, 1998]. Increased 
image noise degrades the match score, whose tolerance was not adjusted, and 
interferes with the gradient estimates used to optimize the shifts. Assuming 
the noise is Gaussian in nature, reducing the resolution also reduces the noise. 
However, there is a limit to how much the resolution can be reduced; sufficient 
information must be retained to perform the reconstruction. This implies that 
there is a range of image resolutions (with favorable noise characteristics and 
sufficient information content) which are essentially equivalent. The quarter 
and eighth resolution reconstructions seem to support this view. 



Chapter 5 

From Surfels to Surfaces 


Theshifts, illumination corrections, and masks introduced in Chapter 3tocom- 
pensate for noisy data also increase the occurrence of false positives. The re¬ 
sults presented in Chapter 4 are purely local and make no attempt to reject 
thesefalse positives. This chapter explores several geometric constraints which 
together eliminate nearly all false positives. 


5.1 Camera Updates 

As pointed out in Section 3.1 unconstrained shifts introduce false positives such 
as the one shown in Figure 3-3. Figures 5-1 and 5-2 shows shifts {(u,i>)} plot¬ 
ted as a vector field in image-space along with the corresponding image. The 
position of each vector corresponds to the location of the center of a matching 
region. Note that the shifts appear to be random. The actual image displace¬ 
ments caused by camera calibration error should be consistent with a transla¬ 
tion of the camera center and a rotation about it or formally 

(«,»)($ =r(p J ,i‘)-r(p,,f) (5.1) 


where 

v) = r{p v Y) 

and Pj is the location of S^. To enforce this constraint we use the foil owing 

algorithm: 

1. For each camera ?. 

(a) Build 

(b) Calculate a first-order camera calibration update,?'. 

(c) Calculatethefinal camera calibration update,?". 

(d) Remove matching regions with shifts that are not consistent with the 
final update, ?". 
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Figure 5-1: Shifts plotted as a vector field and image data for node 27 

image 13. 
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Figure 5-2: Shifts {{u^Vi)} plotted as a vector field and image data for for node 
28 image 17. 
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Finally, for 7, rotation about the optical axis, 

Ax = -(y-y 0 )S'y (5.6) 

Ay = (x-x 0 )8 j. (5.7) 

Equations 5.2-5.7 combine to form the following pair of coupled linear equa¬ 
tions: 

. = P±SEZl^L Se + T„ + (y- y„)h (5.8) 

j CP -z 

V = f2 + (y ~ yo)2 ^ + —Tp -{x- X o)S 7 . (5.9) 

■I CP-z 

Equations 5.8 and 5.9 aresolved in a least squares fashion [Watkins, 1991]. 
P is updated using 86, 84>, 8 7 , T s , and T/ to produce the initial solution in Step 
lb, P'. Thefinal update is 

argmin £ ||(u,i>) - T^P") + T(P J -,P)I| 2 ) (5-10) 

jiff 

( ui,vi)eQ 

where 

Q= |£>||(«,i)-r(p j ,i i ') + r(p„I‘)|| 2 }. 

Typically e is 1 pixel. Solutions which have enough data (at least 4 points) and 
fit well are retained. P" is used to prune inconsistent matches froms,, 

s} \s, if e < ||(«,t>) -T(P,-,P") +r(P,-,f)H 2 - 

After removing inconsistent matches, if % no longer meets the criteria de¬ 
scribed in Section 4.2, it is discarded. Figure 5-6 shows the average rms shift 
before and after performing the calibration update. The number of consistent 
data points (shifts) are plotted along the a; axis and the square root of the mean 
squared shift is plot along the y axis. Note that the necessary shifts are sig¬ 
nificantly reduces. Ideally, after updating the camera calibration estimates, no 
shifts would be required. The images used as input contain a small amount of 
nonlinear distortion which cannot be modeled by the pin-hole camera model. 
We estimate the rms value of this distortion to be about 1 pixel. 

Figure 5-4 shows the consistent surfels remaining after applying this algo¬ 
rithm tothe raw reconstruction shown in Figure 4-22 and Figure 5-5 shows the 
distribution of distances tothe nearest model surface. A number of theconsis- 
tent surfels come from objects which are not in the reference model. The cluster 
of surfels between the building outlines near the top center of Figure 5-4 is one 
example. These surfels come from a nearby building. As a test, a small portion 
of the full reconstruction was rerun with the updated camera calibration. The 
results are shown in Figure 5-7. For comparison, the consistent surfels from 
the same area are shown in Figure 5-8. 

^he internal parameters are held fixed and for stability reasons T? is not updated. 
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Figure 5-4: Consistent surf els. 
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Figure 5-5: Distribution of error for consistent surfels. 



Figure 5-6: Comparison of initial and final calibration estimates. 
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Figure 5-7: Close-up of reconstruction using updated camera calibrations. 



Figure 5-8: Close-up of same area showing only consistent surfels from original 
reconstruction (estimated camera calibrations). 
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5.2 One Pixel One Surfel 



Figure 5-9: A region from one image Figure 5-10: Determining if S b is a 
which contributes to multiplesurfels. neighbor of S a . 

Figure 5-9 shows a region from one image which contributes to the match 
set of multiple surfels. Each pixel in each image should contribute to at most 
one surfel. Deciding which surfel is the hard part. Detection and localization 
as described in Chapter 4 do not enforce this constraint and as a result even 
after enforcing a consistent calibration update there are many image regions 
which contribute to multi pie surfels. We eliminate them in a soft manner using 
thefollowing algorithm: 

1. Score each surfel. 

2. For each surfel S a . 

(a) For each region e s a . 

i. For each surfel S b with a score higher than S a , if s| e s b . 

A. De-weight s* a . 

(b) If the match score is no longer sufficient, prune S a . 

To score a surfel we use a combination of the number of cameras which 
contribute to a surfel and the number of neighbors that it has. A surfel S a is 
considered a neighbor of S b if 1) the distance from P b to the plane containing 
S a (the normal distance) is no more than 4; 2) the distance from the projection 
of S b onto the plane containing S a and P a (the tangential distance) is no more 
than d t ; and, 3) the angle between n a and n h is no more than (3. This notion of 
neighbors is essentially a smoothness constraint and will be used in the next 
section to group surfels. Typically, we use 4 = 15, d t = 300, and f3 = arccos(0.9). 
The de-weighting is done in a continuous manner. S a is divided into sample 
points (we use 10 x 10) each with an initial weight of 1.0. Each sample point 
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Figure 5-11: Surfels after pruning multi pie contributions. 
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Figure 5-12: Distribution of error for pruned surfels. 

is tested against the cone formed by S fc and Q. Those inside the cone have 
their weights updated. The new weight is based on the normal distance and 
the angle between n a and n 6 . If the angle is greater then /3, the new weight 
is 0.0. Otherwise, the new weight is based on the normal distance; 1.0 if it is 
less than d zl 0.0 if it is greater than 3<T, and varies linearly in between. The 
lesser of the currently assigned and new weight is retained. After all $/s have 
been tested against S a , the average sample point weight is used as the new 
weight for s l a . When each s* has been tested the sum of the weights is used to 
determineif S a is retained. Figure5-ll shows the reconstruction after pruning 
multiple contributions and Figure 5-12 shows the distribution of distances to 
the nearest model surface. 


5.3 Grouping Surfels 

The buildings wearetryingto model aremuch larger than an individual surfel. 
Therefore, a large number of surfels should be reconstructed for each actual 
surface. Using the notion of neighbors described in the last section, we group 
the reconstructed surfels as follows: 


1. For each surfel S a . 
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(a) For each surfel S b already assigned a group. 

i. If S a and S fc are neighbors. 

A. If S a has not already been assigned to a group, then assign Sb 
to the group containing S b . 

B. Otherwise merge the groups containing S a and S b . 

(b) If S a has not been assigned to a group, then create a new group and 
assign S a to it. 

In practice we retain only groups which have at least a minimum number 
of (typically four) surfels. All of the surfels in a group should come from the 
same surface. This notion of grouping places no restriction on the underlying 
surface other than smoothness (e.g. it may contain compound curves). Figure 
5-13 shows the reconstruction after grouping and removing groups with fewer 
than four surfels. Nearly all of the surfaces in the reference model have at 
least one corresponding group. Figure 5-14 shows the distribution of distances 
to the nearest model surface. 


5.4 Growing Surfaces 

Many of the groups shown in figure5-13 do not completely cover theunderlying 
surface. There are several reasons why surfels corresponding to actual surfaces 
might not produce a valid match set. The main one is soft occlusion described 
in Section 3.3. Another is local maxima encountered while finding the best 
shifts and updating the surfel's normal. I n the reconstruction process so far, 
the mask technique described in Section 3.3 has not been utilized. In this 
section we make use of it. We also use estimated the shifts and illumination 
corrections to help place us closer to the correct maxima. We use the foil owing 
algorithm to grow surfaces: 

1. For each group. 

(a) Create an empty list of hypothesized surfels. 

(b) For each hypothesized surfel. 

i. Test using the detection and localization algorithms. 

ii. If a match. 

A. Add to the group. 

B. Test against each surfel in each of the other groups. 

If a neighbor, merge the two groups. 

(c) Use the next surfel in the group S a to generate new hypotheses 
and goto Step lb. 
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Figure 5-13: Surfels after grouping. 
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Figure 5-14: Distribution of error for grouped surfels. 


The hypotheses in Step lc are generated from S a by considering the eight 
nearest neighbors in the plane containing S a as shown in Figure 5-15. S a is 
shown in grey and the hypotheses are shown in white. Each of the other recon¬ 
structed surfels in the group are orthographical ly projected onto the hypothe¬ 
sized surfels and hypotheses which are more than half covered are discarded. 
The shifts and illumination corrections associated with S a are used as initial 
values for each hypothesis in Step l(b)i. In addition we lower the minimum 
interest to 25 and the minimum uniqueness to 1.2. Figure 5-16 shows the 
reconstruction after growing. After growing, the coverage of each surface is 
nearly complete. Figure 5-17 shows the distribution of distances to the nearest 
model surface. Figure 5-18 shows grown surfels after removing multiple con¬ 
tributions and grouping as described in the previous two sections and Figure 
5-19 shows the distribution of distances to the nearest model surface. 


5.5 Extracting Models and Textures 

So far, the only assumption we have made about the structure of the world is 
that locally it can be approximated by a plane. All of the buildings imaged in 
our dataset are composed of planar faces, therefore we simply fit planes to the 
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Figure 5-15: Reconstructed surfel (grey) and hypotheses (white). 

groups identified in the previous section. In this case, a face is equivalent to a 
large surfel. The position and orientation of S 9 are determined by 

F 9= J2 wP i / J2 w (5.U) 

SjG.Sg / SjG.Sg 

and 

n g = E wi t / E w ? ( 5 - 12 ) 

where w is the score calculated in Step 1 of the algorithm to remove multiple 
contributions and S g is the set of all surfels in group g. The extent of S g is 
determined by orthographical I y projecting S g ontoS 9 and finding the bounding 
box. Figure 5-20 shows the reconstructed S/s. A total of 15 surfaces were 
recovered. Figure 5-21 shows the distribution of distances to the nearest model 
surface. As noted previously, many surfels come from structures not in the 
reference model. Three of the reconstructed surfaces fall into this category, 
hence Figure 5-21 has a maximum of 80. 

I mage data from contributors tos a can easily be related using the illumina¬ 
tion corrections calculated during detection and localization. The relationship 
between s * 6 and s a where s a and s b are members of the same group is more 
difficult when i is not a contributer of s„. To resolve these relationships we 
construct a table of the illumination corrections for J, the set of images con¬ 
tributing toat least one surfel in S, using the foil owing algorithm: 

1 . For each group. 

(a) Find the surfel with the highest score, S m . 

i. Makes^ the root of the tree. 

ii. Add the illumination correction for to the table where? ^ *. 
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Figure 5-16: Surfels after growing. 
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Figure 5-17: Distribution of error for grown surfels. 


(b) Until no new entries to the table are added, 
i. For each Sj. 

A. If 3 a, b | s? is in the table and s’’ is not, 

Then calculate the illumination correction between ^ and s* n 
and enter it in the table. 

We assume that each surfel in a group has at least one contributing image in 
common with at least one other surfel in the group. When building the table 
we consider only the first correction encountered for each contributing image. 
Oncethetable is built we correct each image in 1 and reproject it ontoS 9 . The 
texture associated with S g is simply the average. Figure 5-22 shows two views 
of the reconstructed textures. Notice that the rows of window in adjacent faces 
are properly aligned. This occurs even though no constraints between faces are 
imposed. 


5.6 Discussion 

This chapter uses several simple geometric constraints to remove virtually all 
false positives from the purely local reconstruction described in Chapter 4. Af- 
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Figure 5-18: Surfels after regrouping. 









5.6. DISCUSSION 


103 



Figure 5-19: Distribution of error for regrouped surfels. 


ter imposing consistent calibration updates, removing multiple contributions 
and grouping, the remaining surfels are excellent seeds for growing surfaces. 
Of the 16 surfaces in the reference model, 12 were recovered. All of the re¬ 
maining surfaces are severely occluded by trees. Nearly all of the images are 
similar to the upper left-hand image of Figures 4-4 and 4-6. In spite of this 
several surfels were recovered on two of the surfaces, however they did not 
survive the grouping process. Figure 5-23 shows the results of growing these 
surfels. The top shows the raw surfels after growing and the bottom shows re¬ 
constructed textures. I n addition to being severly occluded by trees, the other 
two surfaces have very little texture and one of them suffers from a lack of 
data. Three surfaces from adjacent buildings not contained in the model were 
also recovered. The face near the top center of the upper image in Figure 5-22 
is from the Parson's lab. The surfaces on the left of the upper and the right of 
the lower image is from Draper lab. 

A summary of the computation required is presented in Table 5.1. The ma¬ 
jor steps of our algorithm and thesection which describes them are listed along 
with the elapsed and cpu runtimes, number of surfels output, and complexity. 
A total of 64.5 hours of cpu time was needed to produce the textured model 
shown in Figure 5-22. This is less than 1 minute of cpu time per image and 
less than 0.2 seconds of cpu time per test point. The "Detection & Localiza- 
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Figure 5-20: Raw model surfaces. 
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Figure 5-21: Distribution of error for model surfaces. 


Step 

Section 

Time 

Surfels 

Complexity 

Elapsed 

cpu 

Detection & 
Localization 

4.2 

4.3 

42hrst 

60hrs 

54,212 

0(V xNxS) 

Camera 

U pdates 

5.1 

18.5min 

17min 

2977 

0(N x #) 

1 Pixel, 

1 Surf el 

5.2 

3.5min 

2min 

1636 

0(N x # 2 ) 

Group 

5.3 

4min 

2.5min 

1272 

om 

Grow 

5.4 

6hrst 

4hrs 

3007 

0(A x N ) 

Model 

5.5 

11.25min 

9.5min 

15 

om 

Texture 

5.5 

24.5min 

15.5min 

15 

0(A x N ) 


Table 5.1: Run times and orders of growth. 
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Figure 5-22: Textured model surfaces. 
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Figure 5-23: Textured model surfaces with two additional surfaces. 
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tion" and "Grow" steps were performed in parallel on a cluster of 32 400Mhz 
dual processor Pentium 11 machines running linux. The remaining steps were 
performed on a single 200Mhz uniprocessor Pentium Pro machine also run¬ 
ning linux. The two times marked with daggers reflect the speed up of parallel 
processing. All others (including cpu time for "Detection & Localization" and 
"Grow") are total times. The complexities shown are upper bounds. V is the 
reconstruction volume, N is the total number of images, S is the surfel size 
(area in world coordinates), # is the total number of recovered surfels, and A 
is the total surface area (in world coordinates) of the scene to be reconstructed. 


Detection & Localization 

As noted above, the reported elapsed time reflects parallel execution. The total 
elapsed time is 1,260 hrs (42 hrs x 30 nodes). The cpu time required is 1/20 
of the total elapsed time. The major cause of this is I/O. The reconstruction is 
divided into 6000 chunks. Each of these chunks is reconstructed independently 
and requires loading a large image set from a fileserver across a network. The 
work required to test a single surfel is linearly proportional to surfel area in 
world coordinates ( S) and the number of images which view the surfel. The 
total number of surfels tested is linearly proportional to the reconstruction vol¬ 
ume. N is an upper bound on the number of images which view a surfel, but 
it is not a tight bound. For example, only a fraction of the dataset can image 
a particular surfel. Therefore, if the dataset is acquired with roughly a fixed 
node density and images only contribute to surfels which are not too far away 
(Section 4.2), we would expect the number of images which view a surfel to 
eventually become independent of N. In essence, once the dataset has reached 
a certain size, new images extend the reconstruction volume but do not affect 
the local reconstructions at the core of the volume. For large data sets the 
expected order of growth is 0(V x S) 


Camera Updates 

The work required to update a single camera is proportional to the number of 
surfels it contributes to squared and a total of N cameras must be updated. # 
is an upper bound on the number of surfels which a camera can view, but it is 
not a tight one. Similar to the discussion in "Detection & Localization", once 
the reconstruction reaches a certain size, the number of surfels imaged by a 
given camera should remain roughly constant. Spatial hashing can be used to 
help exploit this property. Thus, for large reconstructions the expected order of 
growth is O(N) 
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1 Pixel, 1 Surfel 

The work requi red to i mpose the "1 P i xel, 1 Su rfel" constrai nt for a si ngl e i mage 
contributing to a given surfel is linearly proportional tothenumber of surfels in 
a cone formed by the camera's center of projection and the surfel and extending 
out to the maximum allowed distance from the camera. This constraint must 
be imposed for each image contributing to each of the# surfels. If images can 
contribute only to surfels which are not too close (Section 4.2), then once the 
reconstruction reaches a certain size, the number of surfels in the cone should 
remain roughly constant. Therefore, for large reconstructions with large data 
sets the expected order of growth is 0(#). 

Group 

The work required to group a surfel is linearly proportional to the number of 
surfels in the vicinity and all # surfels must be grouped. The number of surfels 
in a given area is bounded and by using spatial hashing the expected order of 
growth becomes 0(#). 

Grow 

The comments in "Detection & Localization” about elapsed time apply here, 
except that only 15 surfaces were grown, therefore only 15 nodes were used. 
Assuming that the growing algorithm effectively stops at surface boundaries, 
there are a total A/S locations on all faces to be tested. For the faces shown in 
Figure 5-22 this is a good assumption. For large data sets, each location tested 
requires O(S) work, giving an expected order of growth of O(A). 

Model 

The work required to fit a (plane) model is linearly proportional tothenumber 
of surfels in the group. Since every surfel belongs to a group, the given order of 
growth 0(#) is a tight bound. 

Texture 

The work required to extract the texture associated with a face is linearly pro¬ 
portional to the area of the face and the number of images which view it. N 
is an upper bound on the number of images which can view a face and since a 
face can be of arbitrary size it is not clear that we can do better. Therefore, the 
order of growth is 0(A x N). 
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C hapter 6 
Conclusions 


This thesis presents a novel method for automatically recovering densesurfels 
using large sets (1000's) of calibrated images taken from arbitrary positions 
within the scene. Physical instruments, such as Global Positioning System 
(GPS), inertial sensors, and inclinometers, are used to estimate the position 
and orientation of each image. Long baseline images improve the accuracy; 
short baselines and the large number of images simplify the correspondence 
problem. The initial stage of the algorithm is completely local enabling par¬ 
allelization and scales linearly with the number of images. Subsequent stages 
areglobal in nature, exploit geometric constraints, and scalequadratically with 
the complexity of the underlying scene. 

We descri be techniques for: 

• Detecting and localizing surfels. 

• Refining camera calibration estimates and rejecting false positive surfels. 

• Grouping surfels into surfaces. 

• Growing surfaces along a two-dimensional manifold. 

• Producing high quality, textured three-dimensional models from surfaces. 
Some of our approach's most important characteristics are: 

• It is fully automatic. 

• It uses and refines noisy calibration estimates. 

• It compensates for large variations in illumination. 

• It matches image data directly in three-dimensional space. 

• It tolerates significant soft occlusion (e.g. tree branches). 

• It associates, at a fundamental level, an estimated normal (eliminating 
thefrontal-planar assumption) and texture with each surfel. 


Ill 
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Our algorithms also exploit several geometric constraints inherent in three- 
dimensional environments and scale well to large sets of images. We believe 
that these characteristics will be important for systems which automatically 
recover large-scale high-quality three-dimensional models. A set of about 4000 
calibrated images was used to test our algorithms. The results presented in 
this thesis demonstrate that they can be used for three-dimensional recon¬ 
struction. To our knowledge, the City Scanning project (eg. [Coorg, 1998] 
and the work presented in this thesis) is the first to produce high-quality tex¬ 
tured models from such large image sets. The image sets used in this thesis 
are nearly two orders of magnitude larger than the largest sets used by other 
approaches. The approach presented in this thesis, recovering dense surfels 
by matching raw image data directly in three-dimensional space, is unique 
amoung the City Scanning approaches. 


6.1 Future Work 

The major limitation of the work presented in this thesis is the difficulty of 
performing reconstruction "through the trees”. A large part of the difficulty 
is caused by matching against s*. If s* is not representative of the underlying 
texture, then the match set s :) will most likely be insufficient resulting in a 
false negative. If we knew the underlying texture and used it as s*, the match 
set should improve significantly. Similarly, estimating the underlying texture, 
rather than simplying selecting one of the views, and using it ass) should also 
improve the match. Potentially this would allow us recover the missing faces 
in Figure5-22. 

Several other areas could also benefit from further exploration: 

• Develop alternate techniques for exploring the volume of interest. 

• Explore improvements/optimizations to the basic algorithms. 

• Develop a more sophisticated model of illumination and reflectance to im¬ 
prove the correction. 

• Explore fitting a broader class of models to surfels. 

• Develop techniques to enhance the recovered texture. 

6.1.1 Exploring the Volume of Interest 

As noted in Section 4.2, we exhaustively test all of the nearly 2 x l(f points 
in the volume of interest. About 2% of the tested points produce surfels and 



6.1. FUTURE WORK 


113 



Figure 6-1: Free space. 

about 6% of those are consistent with a single set of camera calibration up¬ 
dates. One way of reducing the number of points that need to be tested is to 
use techniques such as binocular stereo to generate test points that are likely 
to produce surfels. Because of error in the estimated three-dimensional posi¬ 
tion, a small volume around the point should be tested. If a surfel is detected 
and localized, additional test points can be generate nearby in a manner sim¬ 
ilar to that used in Section 5.4 to grow surfaces. Another possible approach 
is to keep track of empty space. For example, if surfel S in Figure 6-1 corre¬ 
sponds to an actual surface, then the shaded area must be empty space and 
need not be tested. Conservatively estimating regions of empty space during 
the reconstruction process could significantly reduce the number of points ac¬ 
tually tested. Argus, our image acquisition platform is pushed for node to node 
while obtaining a dataset and the path it follows also identifies empty space 
that limit test points. In a similar fashion, it may prove useful to keep track 
of the general location of soft occluders. For example, views which do not have 
soft occlusion could be selected preferentially or weighted more heavily. 

6.1.2 Improving the Basic Algorithm 

Several components of the basic algorithm could be improved. Several param¬ 
eters such as surfel size and thresholds for interest and uniqueness are se¬ 
lected and utilized on a global basis. These values are dependent upon the 
data and their values should reflect this dependence. For example, the overall 
image brightness and contrast have a large impact on interest and uniqueness 
scores. For various reasons (geometry, sun position, etc.), some building faces 
are frequently brightly illuminated and others are nearly always in shadow. 
Using the same threshold for both tends to produce a large number of false 
positives in theformer case and a large number of false negatives in the latter. 
Thresholds based upon the imaging conditions should produce better results. 
The quality of matches could also be improved by performing comparisons at 
the appropriate resolution. The effective resolution of reprojected regions vary 
with viewing condition (e.g. distance and foreshortening). Source images could 
be loaded into a pyramid of resolutions and prior to matching, a level selected 
such that all the reprojected textures are of similar resolution. This technique 
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should significantly improve the algorithms ability to match distant or highly 
forshorted views. As noted in Section 4.3, updating a surfels position and 
orientation should ideally be done simultaneously. Formulations which com¬ 
bine these optimizations may localize surfels more accurately. And finally, as 
noted in Section 3.3, our masking algorithm is conservative. More sophisti¬ 
cated methods of identifying outlier pixels would enable the masks to be more 
widely used and should result in fewer false negatives. 


6.1.3 Illumination and Reflectance Models 

The correction proposed in Section 3.2 does a good job of compensating for 
changes in viewing condition (illumination, view point, reflectance, etc.), but 
theextra degrees of freedom also admit false positives such as shown in Figure 
3-5. Oneway to reduce the number of false positives is to constrain the correc¬ 
tions by measuring the viewing conditions. For example, the total irradiance 
arriving at a node and an estimate of cloud cover might be sufficient for a rough 
estimate of the correction. The corrections used to extract the textures shown 
in Figure 5-22 are consistent within a face, but not necessarily between faces. 
A direct estimate might make it possible to locate all corrections in a global 
correction space. 

As demonstrated in Figures 4-3 and 4-9 in some cases our method is ca¬ 
pable of compensating for specular reflection. These corrections are generally 
rejected in an attempt to limit false positives. One way to improve match¬ 
ing for surfaces with substantial specular characteristics is to decompose the 
reflectance into specular and diffuse components and use just the diffuse com¬ 
ponent for matching. Nayar et al. [Nayar eta/., 1997] show that the specular 
component of an objects reflectance tends to be linearly polarized while the 
diffuse component is not. Using this property and a polarization image 1 , Na¬ 
yar et al. present an algorithm for separating the components. Wolff [Wolff 
and Andreou, 1995, Wolff, 1997] demonstrates a practical device for acquiring 
polarization images. 

Finally, given a number of images of the same world surface taken under dif¬ 
ferent viewing conditions, it should be possible to estimate its bidirectional re¬ 
flectance distribution function (BRDF) [Florn, 1986]. Tagareand DeFigueredo 
[Tagareand deFigueiredo, 1993], Oren and Nayar [Oren and Nayar, 1995], and 
Dana etal. [Dana etal., 1996] present a techniques for fitting BRDF's to image 
data. 


1 A set of col or i mages of the same scene acquired from the same location each imaged with 
a polarization filter at a different orientation 
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6.1.4 More Descriptive Models 

The models extracted in Section 5.5 are composed of simple rectangles. The 
buildings in our dataset arewell modeled by planar faces, however many build¬ 
ings have components which are not. Fitting general models to data is an open 
problem, however expanding the set of primitives to include cones, cylinders, 
and spheres as well as planes would greatly increase the descriptive power of 
our models. Further, the faces recovered should be part of a closed solid. This 
constraint could easily be enforced by hypothesizing missing faces. Imposing 
this constraint would result in recovering two of the four unrecovered faces. As 
part of the City Scanning project, Cutler [1999] has investigated aggregating 
reconstructions from multiple sources and imposing solid constraints. 

6.1.5 Better Textures 

As described in Section 5.5 the textures displayed in Figure5-22 aregenerated 
by averaging the illumination corrected image data. The results are good but 
can be improved. Iteratively estimating the average texture should improve 
the results. For example, thecurrent average texture is used to reestimatethe 
illumination corrections for each image and weights for each pixel of each im¬ 
age. The new illumination corrections and weights are then used to reestimate 
the average texture until it converges. No attempt has been made to align the 
data from individual images. Warping techniques similar tothose described by 
Szeliski [Szeliski, 1996, Szeliski and Shum, 1997] should improve the quality 
of the extracted texture. 
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Appendix A 
Camera Model 




FigureA-1: Camera model. 

The pin-hole camera model util ized throughout this thesis uses the perspec¬ 
tive projection model of image formation. A 4x3 matrix maps points expressed 
in homogeneous coordinates from V 3 toV 2 . The left side of FigureA-1 shows 
a camera centered Cartesian coordinate system. The camera's optical axis is 
coincident with the 2 axis and its center of projection (C) is at the origin. The 
image plane is parallel to the a;?/ plane and located a distance / from the origin. 
Even though the image plane is not required to be par all el to the xy plane, most 
camera models do not explicitly consider this possibility. I nstead the effects of 
image plane pitch 9 X and tilt 9 y are lumped in with lens distortion under the 
heading of thin prism distortion. We assume that 9 X = 9 y = 0 for consistency 
with traditional pin-hole camera models. The point wherethe image plane and 
the optical axis intersect is known as the principal point. Under perspective 
projection a point P c = [X C ,Y C ,Z C , 1] projects to point p = [x,y, 1] on the image 
plane by the foil owing equations: 
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In practice, we are not able to directly access the image plane coordinates. 

I nstead we have access to an array of pixels in computer memory. I n order to 
understand the relationship between image plane coordinates and the array 
of pixels, we must examine the imaging process. The image plane of a CCD 
camera is a rectangular array of discrete light sensitive elements. The output 
from each of these elements is an analog signal proportional to the amount 
of light incident upon it. The values for each of these elements are read out 
one element at a time row after row until the entire sensor array has been 
read. The analog signal is converted to digital values by internal circuitry. 
Digital cameras use this signal directly. Analog cameras require the use of 
an additional external A to D converter commonly known as a frame grabber. 
In addition, color cameras require the signals from adjacent red, green, and 
blue sensor sites to be combined. The result is an array of digital values which 
can be read into the memory of a computer. These processes introduce noise 
and ambiguity in the the image data (e.g. pixel jitter which extreme cases can 
cause the x and y axes to appear non-orthogonal or skewed [Lenz and Tsai, 
1988]). Most camera models omit skew angle 9 xy (the angle between the x 
and y axes minus 90°). We include 9 xy but assume it is 0. A single element of 
the array in memory is commonly called a pixel. We will refer to the row and 
column number of a given pixel as ?/ and x' respectively. Several parameters 
are defined to quantify the relationship between the array in memory and the 
coordinate system of the image plane. x 0 and y 0 are the pixel coordinates of the 
principal point. s x and s y are the number of pixels in memory per unit distance 
in there and y direction of the image plane. These parameters along with / and 
9 xy are intrinsic or internal camera calibration parameters. The projection of 
point P c topointp' = [x',y',l] in memory is described by the foil owing equations 

x = fs x -b x 0 

/ r Uc 

y = JSy— + y 0 , 

Z c 

or more compactly 

p = PC (A.l) 

s x 0 0 

= P tan^xy s y 0 

x 0 y 0 1/f _ 

C isa3x3 lower triangular matrix which contains the internal camera param¬ 
eters. 
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The right half of FigureA-1 shows an arbitrary Cartesian coordinate system 
which we will refer to as the world or absolute coordinate system. A point 
P w in the world coordinate system is transformed into the camera centered 
coordinate system by the foil owing equation: 

Pc = PwP + T^ 


or 


Pr 


/I,. 


n o 

T 1 


(A. 2) 


Where U is a 3 x 3 orthonormal rotation matrix and T is a 1 x 3 translation 
vector. This matrix is commonly referred to as the extrinsic or external camera 
parameters. 

The pin-hole camera is a linear model and is not able to model nonlinear 
effects such as lens distortion. The major categories of lens distortion are: 


1. Radial distortion - the path of a light ray traveling from the object to the 
image plane through the lens is not always a straight line. 


2. Decentering distortion-the optical axis of individual lens components are 
not always col I inear. 

3. Thin prism distortion - the optical axis of the lens assembly is not always 
perpendicular to the image plane. 
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CAMERA MODEL 



Appendix B 

Gradient Derivations 


This appendix presents the symbolic gradient expressions used along with con¬ 
jugate gradient methods to find the optimum shift (Section B.l) and surfel nor¬ 
mal (Section B.2). 


B.l Gradient Expression for Updating Shifts 

For simplicity, we derive the results for a single color channel (red). To obtain 
expressi ons for the remai ni ng col or channel s si mpl y substitute the appropri ate 
color channel values. Equation 4.3 can be rewritten using Equations 2.9 and 


3.11 as 

e(u, v) = e T (u , v) + e g (u, v ) + e h (u, v ) 

(B.l) 

where 

( \ (( m T ri + <i r ) — T2) j 1 

e r (u,v)= E / E ^ 

xSjeSj '' / xSjeSj 

(B.2) 


ri = r (Ksj) , 

(B.3) 

and 

^ = r (!$) ■ 

(B.4) 

Using linear regression to calculate m, and d r yields 1 



- E r iE r 2 

x,y x,y x,y x,y 

mv ~ E 1 

x,y x,y x,y x,y 

(B.5) 

and 

E r iE r 2 - E r iE r i f 2 

, x,y x ,y x,y x,y 

E x E r i - E r iE r i 

x,y x,y x,y x,y 

(B.6) 


’■To simplify the presentation x,y is used in place of %Sj e Sj. 
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The gradient of Equation B.l is 


^ r ('U,'u) ~|“ 7 u,v^-g (^5 9“ 1 ?) (B.7) 


<9e r <9e r 

l 

de g <9e g 

l 

de h de h 

du ’ dv 


du ’ dv 

~r 

du ’ dv 


Again for simplicity, we derive only To obtain an expression for ^ simply 
substitute v for m. Differentiating Equations B.2, B.5, and B .6 with respect to u 
gives 


<9e r 

du 


£■ 


+ eL 


2(7? 



(B. 8 ) 


(9 m 
( 9 m 


and 

<96 

<9m 


Ell>? ^ EnEn) (ElEn^r + E^E^ 

\x,y x,y x,y x,y ) \x,y x,y x,y x,y 


y- r _ y^^ r 'i y^ r \ _ / 

7^ 1 ‘ky du TTj du 7% ) ' 

fel E’^-2EnE^ 

\ x,y x,y x,y x,y 


K x,y x,y 


- E r iE r 2 

x,y x,y / 

E x E r ? - E r iE r i 

K x,y x,y x,y x,y / 


(B.9) 


(2&I7& + ErtE^ 

\x,y x,y x,y x,y ) \ x,y x,y x,y x,y 


(B.10) 


Y^ dr 1 ^ \ - tei \ - \ - dr 2 \ (^ 2 y^ \ \ -v 

XtttX^ - X^X-aT^Xr-iXT-i— - £ r i£ r 2 - 

X y Uii X y X y X y Uii rj- y rj> y Uii J \ X ^ X , t/ #,?/ # , ?/ > 

feiEte- 2 T r 'T Sr ‘ 


x,y x,y 


x,y x,y 


du 


E x E r ? - X r iE r i 

x, y x,y x,y / 


u and v are shifts in theT^ and yt directions respectively in image i, therefore 
differentiating Equations B.3 and B.4 with respect to u produces 


and 


dr i dr i dx 

du dx du 

dr i 
dx , 

dr 2 dr 2 dx 

du 


(B.ll) 


dx du 

0. 


(B.12) 



B.2. GRADIENT EXPRESSION FOR UPDATING NORMALS 


123 


Substituting Equations B.9, B.10, B.ll, and B.12 into Equation B.8 results in 
an expression which depends only upon the red channel of the raw image data 
and the gradient of the red channel [n, r 2l and ^). Similar results apply for 
^ as wel I as the green and bl ue channel s. 


B.2 Gradient Expression for Updating Normals 


Again for simplicity, we derive the results for a single color channel (red). To 
obtain expressions for the remaining color channels simply substitute the ap¬ 
propriate color channel values. Similar to Equation B.l, Equation 4.7 can be 
rewritten as 

e(0, (p) = e r ((9, 4>) + e g (9, (p) + e b (<9, (p ) (B.13) 


where 


e T (d,(p) = J2 

i 



{{m T ri + d r ) - r 2 y 


at 



(B.14) 


r 1 =r(T(»S,,r)+ [«,«]), 

and 

r 2 = r(T(«'S i ,I*))- 


(B.15) 

(B.16) 



Figure B-l: Surfel with attached coordinate system. 

Figure B-l shows surfel S j with an attached coordinate system. Points on Sy 
can be specified as an offset along xl and yl relative to 

y x Sj = P j + xwxl + ywyl . (B.17) 

w is a scale parameter which determines the spacing between points on the 
surfel. The normal is parameterized as a small rotation relative to the n,-. 6 
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specifies the axis of rotation 


if = xl cos 0 + yt sin 0 (B.18) 

and 4> specify the angl e of rotati on about if. Accounti ng for the updated normal, 
Equation B.17 becomes 

y x Sj = P j + xwTZ(9, (j>)xt + ywTZ(0,4>)yt (B.19) 


where 


77(M) = 


(B.20) 


g* + (1 - g^jcos^ g x g y (1 - cos (/>) + q 7 sin (f) g x g z (l - cos<f) - q y siiuj) 
g x <Zy(l — cos 4>) — g z sin (f> gj; + (1 - gj;) cos 0 g y g 2 (l - cos0) + g x sin9 
q x q z (l - cos (/)) + q y sin 4> g y g z (l - cos<f) - q x sin(j) g? + (1 - q z ) cos 4> 

The image point to which y Sj projects is 

n y x s>, I) = W,v'\ 


where 


and 


/ „ xT ■ ( y Sj — Q) 

x = / -=r—TiTci-7TT + ^0 + u, 


Z C : ■ (:!*, - Q 


y' = + m + 


^ • (xSj - Q) 

The gradient of Equation B.13 is 

T7e,<t> e {9,4>) = Vg^e T (9, cp) + Vg^e^O, <f) + <f) 


(B.21) 

(B.22) 


9e r 9e r 

l 

de g de g 

l 

de h de h 

dO 1 d(f> 

~r 

dO ’ d(j) 

~r 

dO ’ dcj) 


(B.23) 


First we will examine^. Differentiating Equation B.14gives 


<9e r 

90 


= E 


((m r ri + dr) - r 2 ) (m r ^ 


I dm T r , i 9d r _ drs 

^ <90 ' 1 f 


<90 <90 


2 ( 7 ? 


E 1 - (B .24) 


#,?/ r / 

Differentiating Equations B.9and B.10 and using the chain rule produces 


dm r dm r dx dm r dy 


d9 


dx 09 dy dO 


(B.25) 


dd r dd r dx dd r dy 
d9 dx d9 dy d6' 


(B.26) 


and 
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The expressions for ^ and ^ are given in the last section. Differentiating 

1 ox oy u 

Eguations B.15 and B.16 and using thechain ruleyields 


dr i dr i dx dr\ dy 
d9 dx dO dy dO 


and 


dr 2 

dO dx dO 

Eguations B.21 and B.22 produce 


dr 2 dx dr 2 dy 

dy dO' 


dx = X. • (gSj - Cj) x^ ■ ^ - X?. • (gSj - CJ z£ • ^ 
^ (^ • (xSj - CJ) 2 


<90 


and 


dy_ = X ■ ( y x Sj (%S j - CJ^ • — 

^ (^ • - CJ) 2 


<90 


Differentiating Eguation B.19 yields 


-i = xw -—— x s + yw -— y s 


dO 


dO 


dO 


where 

dH{0,<j>) _ 

90 “ 

2q x (1 — cos <^>) (<? x -§$- + ^§Q-q y )(X — c°c 4>) + sio 0 (*? x — cos (/>) — 

(Qx -§q~ + ^$r*2y)( 1 — cos <f>) — sin 2q y (1 — cos <f)) {q y -) q z )( 1 — COS <f)) + 

(qx ST + Qz)(l — c°c <A) H sin c/> (q y H (?z)(l — cos </>) — sin (f> 2q z (1 — cos tj>) 


(B.27) 

(B.28) 

(B.29) 

(B.30) 

(B.31) 

(B.32) 


dq y • a. 
sr sm< ^ 

dqy- ■ , 

ST sm< ^ 


and 

c )—^ 

—— = ytcosO - T^sin0. (B.33) 

dO 

Substituting Eguations B.25, B.26, B.27, B.28, B.29, B.30, B.31, B.32, and B.33 
into Eguation B.14 results in an expression for || that is easily evaluated and 
depends only upon known guantities. Specifically, the red channel of the raw 
image data {r { and r 2 ), the gradient of the red channel and |^J, a scale 
parameter ( w ), the initial surfel (P,-, xt, and yt), the best shifts [u and v ), and 
the camera parameters (/, x 0 , y 0 , x^, yZ, z£, and CJ. 

The derivation for^ is nearly identical. a7 ^ ( ^ ) istheonly significant differ¬ 
ence, 


dH(0, <j>) 

d(j) 

(1 - 0x) S in0 
q x q y sin </> - q z cos (j) 

_ 9x^(1 - sin J) + q y cos (j) 


q x q y sin (j) + q z cos (f> 

(1 - q y ) sin (f) 
q y q z (l- sin j) - q x coscj) 


(B.34) 

q x q z sin J — q y cos (j) 
q y q z ( 1 - sin <j>) + q x cos <fi . 

(1 — q z ) sin (f> 
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Similar results apply for thegreen and blue channels. 



Bibliography 


[Adelson and Weiss, 1996] E. H. Adelson and Y. Weiss. A unified mixture 
framework for motion segmentation: Incorporating spatial coherence and 
estimating the number of models. In Computer Vision and Pattern Recog¬ 
nition (CVPR '96 Proceedings), pages 321-326, J une 1996. San Francisco, 
CA. 

[Ayer and Sawhney, 1995] S. Ayer and H. Sawhney. Layered representation of 
motion video using robust maximum-likelihood estimation of mixture mod¬ 
els and mdl encoding. I n International Conferenceon Computer Vision (ICCV 
'95 Proceedings), pages 777-784, J une 1995. Cambridge, M A. 

[Azarbayejani and Pentland, 1995] A. Azarbayejani and A. Pentland. Recur¬ 
sive estimation of motion, structure, and focal length. IEEE Transactions on 
Pattern Analysis and Machine!ntelIigence, 17(6):562-575, J une 1995. 

[Baker and Bolles, 1989] H. Harlyn Baker and Robert C. Bolles. Generalizing 
epi polar-plane image analysis on the spatiotemporal surface. I nternational 
J ournal of Computer Vision, 3(l):33-49, May 1989. 

[Becker and Bove, 1995] Shawn Becker and V. Michael Bove, J r. Semiauto¬ 
matic 3-D model extraction from uncalibrated 2-D camera views. In Visual 
Data Exploration and Analysis, volume 2410, pages 447-461. SPIE, Febru¬ 
ary 1995. San J ose, CA. 

[Belhumeur and Mumford, 1992] Peter N. Belhumeur and David Mumford. 
A bayesian treatment of the stereo correspondence problem using half- 
occluded regions. In Computer Vision and Pattern Recognition (CVPR 92 
Proceedings), pages 761-764, J une 1992. Champaign, IL. 

[Belhumeur, 1993] Peter N. Belhumeur. A binocular stereo algorithm for re¬ 
construction sloping, creased, and broken surfaces in the presence of half¬ 
occlusion. In I nternational Conference on Computer Vision (ICCV 93 Pro¬ 
ceedings), pages 534-539, May 1993. Berlin, Germany. 

[Belhumeur, 1996] Peter N. Belhumeur. A bayesian approach to binocular 
stereopsis. I nternational J ournal of Computer Vision, 19(3):237-262, 1996. 


127 



128 


BIBLIOGRAPHY 


[Bolles etal., 1987] Robert C. Bolles, H. Harlyn Baker, and David H. Mari- 
mont. Epi polar-plane image analysis: An approach to determining structure 
from motion. I international J ournal of Computer Vision, l(l):7-55, 1987. 

[Chou and Teller, 1998] G. T. Chou and S. Teller. Multi-level 3d reconstruc¬ 
tion with visibility constraints. I n I mage Understanding Workshop (IUW '98 
Proceedings), volume 2, pages 543-550, November 1998. Monterey, CA. 

[Collins etal., 1995] R. Collins, C. J aynes, F. Stolle, X. Wang, Y. Cheng, A. Han¬ 
son, and E. Riseman. A system for automated site model acquisition. In 
Integrating Photogrammetric Techniques with Scene Analysis and Machine 
Vision II, volume 7617. SPIE, April 1995. Orlando, FL. 

[Collins, 1996] Robert T. Collins. A space-sweep approach to true multi-image 
matching. I n Computer Vision and Pattern Recognition (CVPR '96 Proceed¬ 
ings), pages 358-363, J une 1996. San Francisco, CA. 

[Coorg etal., 1998] Satyan Coorg, Neel Master, and Seth Teller. Acquisition of 
a large pose-mosaic dataset. In Computer Vision and Pattern Recognition 
(CVPR '98 Proceedings), pages 872-878, J une 1998. Santa Barbara, CA. 

[Coorg, 1998] Satyan Coorg. Pose Imagery and Automated 3-D Modeling of 
Urban Environments. PhD thesis, MIT, 1998. 

[Cox etal., 1996] Ingemar J. Cox, Sunita L. Hingorani, Satish B. Rao, and 
Bruce M. Maggs. A maximum likelihood stereo algorithm. Computer Vi¬ 
sion and ImageUnderstanding, 63(3):542-567, May 1996. 

[Cox, 1994] IngemarJ. Cox. A maximum likelihood n-camera stereoalgorithm. 
In Computer Vision and Pattern Recognition (CVPR '94 Proceedings), pages 
733-739, J une 1994. Seattle, WA. 

[Cutler, 1999] Barbara M. Cutler. Aggregating building fragments generated 
from geo-referenced imagery into urban models. Master's thesis, MIT, 1999. 

[Dana et al., 1996] Kristin J. Dana, Bram van Ginneken, Shree K. Nayar, and 
J an J. Koenderink. Reflectance and texture of real-world surfaces. Technical 
Report CUCS-048-96, Columbia University, December 1996. 

[DeCouto, 1998] Douglas S. J. De Couto. Instrumentation for rapidly acquir¬ 
ing pose-imagery. Master's thesis, MIT, 1998. 

[Debevec etal., 1996] Paul E. Debevec, CamilloJ .Taylor, andj itendra Malik. 
Modeling and rendering architecture from photographs: A hybrid geometry- 
and image-based approach. In Computer Graphics (SI GGRAPH '96 Proceed¬ 
ings), pages 11-20, August 1996. 



BIBLIOGRAPHY 


129 


[Duda and Hart, 1973] Richard O. Duda and Peter E. Hart. Pattern Classifi¬ 
cation and Scene Analysis. J ohn Wiley & Sons, New York, NY, 1973. 

[Faugeras and Robert, 1994] Olivier Faugeras and Luc Robert. What can two 
images tell us about a third one? I n European Conference on Computer Vi¬ 
sion (ECCV '94Proceedings), pages485-492, May 1994. Stockholm, Sweden. 

[Faugeras, 1992] Olivier Faugeras. What can be seen in three dimensions with 
an uncalibrated stereo rig? In European Conference on Computer Vision 
(ECCV '92 Proceedings), pages 563-578, May 1992. Santa Margherita Lig- 
ure, Italy. 

[Faugeras, 1993] Olivier Faugeras. Three-Dimensional Computer Vision. MIT 
Press, Cambridge, MA, 1993. 

[Fua and Leclerc, 1995] P. Fua and Y. G. Leclerc. Object-centered surface re¬ 
construction: Combining multi-image stereo and shading. International 
J ournal of Computer Vision, 16(l):35-56, September 1995. 

[Fua and Leclerc, 1996] P. Fua and Y.G. Leclerc. Taking advantage of image- 
based and geometry-based constraints to recover 3-D surfaces. Computer 
Vision and I mage U nderstandi ng, 64(1):111-127, J uly 1996. 

[Fua, 1995] P. Fua. Reconstructing complex surfaces from multiple stereo 
views. In I nternational Conference on Computer Vision (ICCV 95 Proceed¬ 
ings), pages 1078-1085, J une 1995. Cambridge, MA. 

[Gering and Wells, 1999] David T. Gering and William M. Wells, III. Object 
modeling using tomography and photography. In IEEE Workshop on Multi¬ 
view Modeling and Analysis of Visual Scenes, pages 11-18, J une 1999. Fort 
Collins, CO. 

[Greeve, 1996] C. W. Greeve, editor. Digital Photogrammetry: an Addendum to 
the Manual of Photogrammetry. American Society of Photogrammetry and 
Remote Sensing, Falls Church, VA, 1996. 

[Hartley eta/., 1992] Richard Hartley, Rajiv Gupta, and Tom Chang. Stereo 
from uncalibrated cameras. In Computer Vision and Pattern Recognition 
(CVPR 92 Proceedings), pages 761-764, J une 1992. Champaign, IL. 

[Hartley, 1992] Richard Hartley. Estimation of relative camera positions for 
uncalibrated cameras. In European Conference on Computer Vision (ECCV 
92Proceedings), pages 579-587, May 1992. Santa Margherita Ligure, Italy. 

[Horn, 1986] Berthold Klaus Paul Horn. Robot Vision. MIT Press, Cambridge, 
MA, 1986. 



130 


BIBLIOGRAPHY 


[Horn, 1987] Berthold Klaus Paul Horn. Closed-form solution of absolute ori¬ 
entation using unit quaternions. J ournal of the Optical Society of America, 
4(4):629-642, April 1987. 

[Irani etal., 1997] M. Irani, B. Rousso, and S. Peleg. Recovery of ego-motion 
using region alignment. IEEE Transactions on Pattern Analysis and Ma¬ 
chine IntdIigence, 19(3):268-272, March 1997. 

[Kanadeand Okutomi, 1994] TakeoKanadeand Masatoshi Okutomi. A stereo 
matching algorithm with an adaptive window: Theory and experiment. 
IEEE Transactions on Pattern Analysis and Machine! ntd I igence, 16(9):920- 
932, September 1994. 

[Kang and Szeliski, 1996] Sing Bing Kang and Richard Szeliski. 3-D scene re¬ 
covery using omnidirectional multi baseline stereo. In Computer Vision and 
Pattern Recognition (CVPR '96 Proceedings), pages 364-370, J une 1996. San 
Francisco, CA. 

[Kang etal., 1995] Sing Bing Kang, J on A. Webb, C. Lawrence Zitnick, and 
Takeo Kanade. An active multi baseline stereo system with active illumina¬ 
tion and realtime image acquisition. In International Conference on Com¬ 
puter Vision (ICCV '95 Proceedings), pages 88-93, J une 1995. Cambridge, 
MA. 

[Kutulakos and Seitz, 1998a] Kiriakos N. Kutulakos and Steven M. Seitz. A 
theory of shape by space carving. CS Technical Report 692, University of 
Rochester, M ay 1998. 

[Kutulakos and Seitz, 1998b] Kiriakos N. Kutulakos and Steven M. Seitz. 
What do n photographs tell us about 3D shape? CS Technical Report 680, 
U niversity of Rochester, J anuary 1998. 

[Lenz and Tsai, 1988] Reimar K. Lenz and Roger Y. Tsai. Techniques for cali¬ 
bration of the scale factor and image center for high accuracy 3-D machine 
vision metrology. IEEE Transactions on Pattern Analysis and Machine In¬ 
telligence, 10(5):713-720, September 1988. 

[Li and Chen, 1995] Xiaoping Li and Tongwen Chen. Optimal £, approxi¬ 
mation of the gaussian kernel with application to scale-space construc¬ 
tion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 
17(10):1015-1019, October 1995. 

[Longuet-Higgins, 1981] H. C. Longuet-Higgins. A computer algorithm for re¬ 
constructing a scene from two projections. Nature, 293:133-135, September 
1981. 



BIBLIOGRAPHY 


131 


[Marr and Poggio, 1979] D. Marr andT. Poggio. A computational theory of hu¬ 
man stereo vision. Proceedings of the Royal Society of London, B(204):301- 
328, 1979. 

[Mayhew and Frisby, 1991] J ohn E. W. Mayhew and J ohn P. Frisby, editors. 
3D Model Recognition from Stereoscopic Cues. MIT Press, Cambridge, MA, 
1991. 

[Mohr and Arbogast, 1991] R. Mohr and E. Arbogast. It can be done without 
camera calibration. Pattern Recognition Letters, 12(l):39-43, J anuary 1991. 

[Mohr et al., 1993] R. Mohr, F. Veil Ion, and L. Quan. Relative 3D reconstruc¬ 
tion using multiple uncalibrated images. In Computer Vision and Pattern 
Recognition (CVPR '93 Proceedings), pages 543-548, J une 1993. New York, 
NY. 

[Nayar et al., 1997] Shree K. Nayar, Xi-Sheng Fang, and Terrance Boult. Sep¬ 
aration of reflection components using color and polarization. International 
J ournal of Computer Vision, 21(3):163-186, February 1997. 

[Ohta and Kanade, 1985] Y. Ohta and T. Kanade. Stereo by two-level dynamc 
programming. IEEE Transactions on Pattern Analysis and Machine I ntel I i- 
gence, 7(2):139-154, April 1985. 

[Okutomi and Kanade, 1993] Masatoshi Okutomi and Takeo Kanade. A 
multiple-baseline stereo. IEEE Transactions on Pattern Analysis and Ma¬ 
chine I ntdiigence, 15(4):353-363, April 1993. 

[Oren and Nayar, 1995] Michael Oren and Shree K. Nayar. Generalization of 
the lambertian model and implications for machine vision. I nternational 
J ournal of Computer Vision, 14(3):227-251, April 1995. 

[Pollard daL, 1985] S. B. Pollard, J. E. W. Mayhew, and J. P. Frisby. Pmf: A 
stereo correspondence algorithm using a disparity gradient constraint. Per¬ 
ception, 14:449-470, 1985. 

[Press etal., 1992] William H. Press, Saul A. Teukolsky, William T. Vetterling, 
and Brian P. Flannery. Numerical Recipes in C: The Art of Scientific Com¬ 
puting. Cambridge University Press, 1992. 

[Ramachandran and Lakshminarayanan, 1971] G. N. Ramachandran and 
A. V. Lakshminarayanan. Three dimensional reconstruction from radio¬ 
graphs and electron micrographs: Applications of convolutions instead 
of fourier transforms. Proceedings of the National Academy of Science, 
68:2236-2240, 1971. 



132 


BIBLIOGRAPHY 


[Seales and Faugeras, 1995] W. Brent Seales and Olivier D. Faugeras. Build¬ 
ing three-dimensional object modelsfrom image sequences. Computer Vision 
and I mage Understanding, 61(3):308-324, May 1995. 

[Seitz and Dyer, 1997] Steven M. Seitz and Charles R. Dyer. Photorealistic 
scene reconstruction by voxel coloring. In Computer Vision and Pattern 
Recognition (CVPR '97 Proceedings), pages 1067-1073, 1997. Puerto Rico. 

[Shum etal., 1998] Fleung-YeungShum, Mei Flan, and Richard Szeliski. Inter¬ 
active construction of 3D models from panoramic mosaics. I n Computer Vi¬ 
sion and Pattern Recognition (CVPR '98 Proceedings), pages 427-433, J une 
1998. Santa Barbara, CA. 

[Slama deal., 1980] Chester C. Slama, CharlesTheurer, and Soren W. Flenrik- 
sen, editors. Manual of Photogrammetry. American Society of Photogram- 
metry and Remote Sensing, Falls Church, VA, 4th edition, 1980. 

[Szeliski and Kang, 1994] Richard Szeliski and Sing Bing Kang. Recovering 
3D shape and motion from image streams using non-linear least squares. 
Journal of Visual Communications and I mage Representation, 5(1):10- 28, 
1994. 

[Szeliski and Shum, 1997] Richard Szeliski and Flarry Shum. Creating full- 
view panoramic mosaics and texture-mapped 3D models. In Computer 
Graphics (5IGGRAPH '97 Proceedings), pages 251-258, August 1997. 

[Szeliski andTonnesen, 1992] Richard Szeliski and David Tonnesen. Surface 
modeling with oriented particle systems. In Computer Graphics (SIG- 
GRAPH 92 Proceedings), volume 26, pages 185-194, J uly 1992. 

[Szeliski, 1996] Richard Szeliski. Video mosaics for virtual environments. 
IEEE Computer Graphics and Applications, 16(2):22- 30, March 1996. 

[Tagare and deFiguei redo, 1993] Flemant D. Tagare and Rui J. P. de- 
Figueiredo. A framework for the construction of reflectance maps for ma¬ 
chine vision. Computer Vision, Graphics And I mage Processing: ImageUn- 
derstanding, 57(3):265-282, May 1993. 

[Teller, 1998a] Seth Teller. Automated urban model acquisition: Project ratio¬ 
nale and status. I n Image Understanding Workshop (IUW 98 Proceedings), 
volume 2, pages 455-462, November 1998. Monterey, CA. 

[Teller, 1998b] Seth Teller. Toward urban model acquisition from geo-located 
images. In Pacific Graphics 98, 1998. Singapore. 



BIBLIOGRAPHY 


133 


[Tomasi and Kanade, 1992] Carlo Tomasi and Takeo Kanade. Shape and mo¬ 
tion from image streams under orthography: A factorization method. Inter¬ 
national J ournal of Computer Vision, 9(2):137-154, November 1992. 

[Tsai, 1983] Roger Y. Tsai. Multi frame image point matching and 3-D surface 
reconstruction. IEEE Transactions on Pattern Analysis and Machine Intelli¬ 
gence, 5(2): 159-174, March 1983. 

[Tsai, 1987] Roger Y. Tsai. A versatile camera calibration technique for high- 
accuracy three dimensional machine vision metrology using off-the-shelf 
TV cameras and lenses. IEEE Journal of Robotics and Automation, RA- 
3(4):323-344, August 1987. 

[Ullman, 1978] Shimon Ullman. The interpretation of structure from motion. 
A.I. Memo476, MIT, October 1978. 

[Ullman, 1979] Shimon Ullman. The interpretation of structure from motion. 
Proceedings of the Royal Socidty of London, B(203):405-426, 1979. 

[Vexcel, 1997] Vexcel. Thefotog family of products and services, 1997. Avail¬ 
able at http://www.vexcel.com/fotog/product.html. 

[Vora eta/., 1997a] Poorvi L. Vora, Joyce E. Farrell, Jerome D. Tietz, and 
David H. Brainard. Digital color cameras - response models. Technical Re¬ 
port HPL-97-53, Hewlett-Packard Laboratories, March 1997. 

[Vora eta/., 1997b] Poorvi L. Vora, Joyce E. Farrell, Jerome D. Tietz, and 
David H. Brainard. Digital color cameras - spectral response. Technical 
Report HPL-97-54, Hewlett-Packard Laboratories, March 1997. 

[Watkins, 1991] D. Watkins. Fundamentals of Matrix Computations. J ohn Wi¬ 
ley and Sons, I nc., 1991. 

[Wolf, 1974] Paul R. Wolf. Elements of Photogrammdtry. McGraw-Hill, New 
York, NY, 1974. 

[Wolff and Andreou, 1995] Lawrence B. Wolff and Andreas G. Andreou. Polar¬ 
ization camera sensors. I mage and Vision Computing, 13(6):497-510, Au¬ 
gust 1995. 

[Wolff, 1997] Lawrence B. Wolff. Polarization vision: A new sensory approach 
to image understanding. I mage and Vision Computing, 15(2):81-93, Febru¬ 
ary 1997. 

[Yachida, 1986] M. Yachida. 3D data acquisition by multiple views. In O. D. 
Faugeras and G. Giralt, editors, Robotics Research: theThird International 
Symposium, pages 11-18. MIT Press, Cambridge, MA, 1986. 



